Python: Keepning only the outerloop max result when comparing string similarity of two lists

Question

I have two table with an unequal amount of columns but with the same order, lets call the old and new. old has more columns than new than new.

The difference between them is that the spelling has changed as in spaces get replaced by _ and names get shortened from ex item name to item.

Ex:

old=['Item number','Item name', 'Item status', 'Stock volume EUR','Stock volume USD', 'Location']

new=['Item_number','Item', 'Item_status','Stock volume EUR', 'Location']

In reality if have a 50 column long list and 4 columns less in the new list.

Currently i have made list of the column headers and applied the levenshtein distance divided by sting length through a nested loop to find the most similar strings.

My next step i assume is change the nested loop in order to only keep the max result for each outer loop but i do not know how to go about that or if that is the right step.

distance=[jellyfish.levenshtein_distance(x,y)/len(x)for x in a for y in b

I want to use the new column headers on the old list and remove the columns that have no match in the new table

There are a few things you need to consider first. 1) What is going to be your threshold for the `levenshtein_distance` to say that two column names are close enough? 2) Is it possible to have multiple mappings of a column due to equal `levenshtein_distance` values? — cwalvoort, Jun 02 '19 at 12:38
as i am dividing with sting lenght i am getting unique results (i think) but for some strings i get 2 matches and i only want the larger of the two. — onehouse, Jun 02 '19 at 13:01

Neel · Answer 1 · 2019-06-02T18:24:15.653

I am here to suggest a different approach altogether. Since you are using Levenshtein distance, I suggest using the fuzzywuzzy package that has a faster implementation of the same, as well as some ready-made methods that will perfectly fit your purpose.

from fuzzywuzzy import process
from fuzzywuzzy.fuzz import ratio
old=['Item number','Item name', 'Item status', 'Stock volume EUR',
     'Stock volume USD', 'Location']

new=['Item_number','Item', 'Item_status','Stock volume EUR', 'Location']

mapper = [(new_hdr,process.extractOne(new_hdr,old,scorer=ratio)[0]) for new_hdr in new]
df = df[[i[1] for i in mapper]]
df.columns = [i[0] for i in mapper]

The solution is much more precise in terms of coding and readability. However, depending on the exact strings, the extractOne method may fail to identify the correct map for all cases. Check if that is happening or not. Check if some such cases are happening or not. Accordingly we may have to customize the scorer argument in extractOne

I presumed that you are using the lists to filter and rename columns of a pd dataframe called df. If thats not the case, feel free to ignore last two lines. the `mapper` is basically the map of the new list items with the old one — Neel, Jun 02 '19 at 18:28

Python: Keepning only the outerloop max result when comparing string similarity of two lists

1 Answers1

Linked