0

I have a semi unique problem and I have no clue where to start. I'm using python

So im trying to get a bunch of info about items off two API's and these API uses two different id methods

Name and ID

The Name will look something like: Helmet of Divan

The ID will look like: DIVAN_HELMET

This is easy for me connect the two in a dictionary. My problem is sometimes the names will have suffixes and prefixs. Such as:

Wise Helmet of Divan or Clean Helmet of Divan or even have Unicode like ✪ Helmet of Divan ✪.

I want to get the ID DIVAN_HELMET from these names, but I can't know how many characters the prefix is or even if there is a suffix/prefix. I need to do this in mass for over 3 thousand items with dozens of suffixes and prefixes.

DarkKnight
  • 19,739
  • 3
  • 6
  • 22
  • I have no idea what "semi unique" means. Something is either unique or it isn't. Also, you forgot to post the code you're struggling with https://stackoverflow.com/help/minimal-reproducible-example – DarkKnight Jul 23 '23 at 06:50
  • If the base name is always present, then you can use a loop with `if looking_for in new_name:`. It's going to take time – Tim Roberts Jul 23 '23 at 06:54
  • `"Helmet of Divan" in "Wise Helmet of Divan"` will return you true if first is part of second. Is it solves your problem? – dath.vg Jul 23 '23 at 06:55
  • Have you tried regex? It will be better if you share your code – Priyanshu Jul 23 '23 at 06:57

1 Answers1

0

So you want to get such output: DIVAN_HELMET

From such inputs: Wise Helmet of Divan or Clean Helmet of Divan or ✪ Helmet of Divan ✪

First you can remove all non-ASCII characters, e.g. like this answer :

import string
printable = set(string.printable)
str_input = ''.join(filter(lambda x: x in printable, str_input))

Then you need to convert them to all lowercase like str_input = str_input.lower()

Then, you need to tokenize the input; the easiest way is just to split it by space, e.g.: arr_str_input = str_input.split(" ")

Then you need to remove the stopwords like 'of' or 'the'. For this step you can use publicly available stopword list like this or just hardcode removal of word 'of' if that's all the stopword in your input text. e.g.: arr_str_input.remove("of")

Then you need to remove the prefix or suffix. In this step you can just supply the list of all prefix/suffix yourself or use readily made one like this (be careful since it can be very big list)

After all that, you should have a list/array of only 2 word like ['helmet','divan']. Last step should be just arranging them and making them uppercase, e.g.:

result = ['helmet','divan']
result.reverse()
print('_'.join(result).upper())
# outputs DIVAN_HELMET
Kristian
  • 2,456
  • 8
  • 23
  • 23