0

I have a list of approx. 150 mineral names that don't quite match their equivalents in an approved list of several thousand mineral names; some of the mineral names in my list differ in some way from their approved equivalents (e.g. I may have an entry 'Amphibole(Barroisite)' rather than the accepted 'Barroisite').

I need a list that comprises the ~150 approved equivalent mineral names. I think the way to go about this is to use a list comprehension to generate a new list from partial matches between entries in the two lists but I just can't get anything to work. I have previously checked the likes of Partial String match between two lists in python but have had no luck.

Examples of entries from my list and the approved list below:

approved_list = ['Aegirine','Barroisite','Cuprite','Pyrope','Rosasite','Traskite','Vaesite']

my_list = ['Pyroxene(Aegirine)','Amphibole(Barroisite)','Cuprite','Garnet(Pyrope)', 'Rosasite']

In the above example I would ideally generate a list comprising Aegirine, Barroisite, Cuprite, Pyrope, and Rosasite. The solution would also need to be flexible (e.g. cant rely on position of brackets) as there are a number of differences between some entries.

Thanks in advance for any ideas!

geolguy
  • 11
  • 1
  • Does this answer your question? [how to 'fuzzy' match strings when merge two dataframe in pandas](https://stackoverflow.com/questions/49120364/how-to-fuzzy-match-strings-when-merge-two-dataframe-in-pandas) – BeRT2me Aug 01 '22 at 05:39
  • Do your strings always follow the form `name` or `name(other_name)`? I.e. with no spaces outside the names themselves or any other characters? Or is there more variation? Also do you need a solution for actual Python lists, or is your data in some other format? – Grismar Aug 01 '22 at 05:46
  • @BeRT2me I'm afraid not as I'm looking to retrieve a list of ~150 entries rather than merge the two lists, but thank you for introducing me to fuzzywuzzy! That may be another way to look at the issue. – geolguy Aug 01 '22 at 07:39
  • @Grismar Not necessarily. Many do follow 'name(other_name)' but others may include character differences (e.g. character vs character with umlaut). The mineral lists are derived from dataframe column headers; I tend to modify lists and then reassign as column headers. My actual mineral lists are as the examples, just much longer! – geolguy Aug 01 '22 at 07:42

1 Answers1

0

It's hard to provide a complete answer with vague requirements. You'd have to specify more clearly what variations are possible.

But here is an example that ignores capitalisation, extra/missing diacritics (like umlaut - assuming the characters would be the same without diacritics, i.e. ä -> a and not ä -> ae), and whitespace:

import unicodedata


def strip_diacritics(s):
    return ''.join(
        # break down into characters after normalising:
        c for c in unicodedata.normalize('NFD', s)  
        # not a non-spacing mark:
        if unicodedata.category(c) != 'Mn'  
    )


approved_list = ['Aegirine', 'Barroisite', 'Cuprite', 'Pyrope', 'Rosasite', 'Traskite', 'Vaesite']

my_list = ['Pyroxene(Aegirine)', 'Amphibole(Barroïsite)', 'cuprite', 'Garnet (Pyrope)', 'Rosasite ']

# create a quick lookup from normalised name to desired name
approved_dict = {strip_diacritics(name).strip().lower(): name for name in approved_list}

new_list = [
    next(name for key, name in approved_dict.items()
         if key in strip_diacritics(test).strip().lower())
    for test in my_list
]

print(new_list)

Note how I introduced some problems into my_list and how that doesn't affect the outcome. Output:

['Aegirine', 'Barroisite', 'Cuprite', 'Pyrope', 'Rosasite']
Grismar
  • 27,561
  • 4
  • 31
  • 54