fuzzy lookup between 2 series/columns of nonidentical lengths

Question

I am trying to do a fuzzy lookup between 2 series/columns between df1 and df2 where df1 is the dictionary file(to be used as a base) and df2 is the target file(to be looked up on)

import pandas as pd
df1 = pd.DataFrame(data ={'Brand_var':['Altmeister Bitter','Altos Las Hormigas Argentinian Wine','Amadeus Contri Sparkling Wine','Amadeus Cream Liqueur','Amadeus Sparkling Sparkling Wine']})
df2 = pd.DataFrame(data = {'Product':['1960 Altmeister 330ML CAN METAL','Hormi 12 Yr Bottle','test']})

I looked up for some solutions in SO, unfortunately dont seem to find a solution.

Used:

df3 = df2['ProductLongDesc'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var'])[0])

also :

df3 = df2['Product'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var']))

The first one gives me an index error and the second one gives me just the indexes.

My desired output is to print a mapping between df1 item and df2 items using a fuzzy lookup and printing both Brand_var and Product for their respective matches.

Desired Output:

Brand_var                            Product
Altmeister Bitter                    1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine  Hormi 12 Yr Bottle

For the non matching items ex: test in df2, can be ignored.

Note: The matching string name also could be non identical, as in it can have 1 or 2 letter missing in it. :(

Thank you in advance for taking your time out for this issue. :)

The problem is due to a dependency, i cannot pip install fuzzywuzzy :( any other way which already exists in anaconda? — anky, Aug 29 '18 at 14:46
Thanks@Scott however what I meant was any other options other than using fuzzywuzzy? — anky, Aug 29 '18 at 15:01
At least, with your method, if you want a result, you need to change the `cutoff` in `get_close_matches` such as: `df2['Product'].apply(lambda x: difflib.get_close_matches(x, df1['Brand_var'],n=1,cutoff=0)[0])` or any value in `cutoff=` lower than 0.6 which is the default value and it's why you don't get any result. But this does not give you the excpected output with your data. — Ben.T, Aug 29 '18 at 15:10
Thanks @Ben.T if you can also take a look at my updated question at: https://stackoverflow.com/questions/52092961/fuzzy-lookup-between-2-series-df-columns i would be obliged. :) — anky, Aug 30 '18 at 10:22
THanks @Wen I have installed and updated my question at if you can also take a look at my updated question at: https://stackoverflow.com/questions/52092961/fuzzy-lookup-between-2-series-df-columns — anky, Aug 30 '18 at 10:23

Mikhail Stepanov · Accepted Answer · 2018-08-29T17:25:35.000

If you install fuzzywuzzy, you still stay with a problem how to choose proper heuristic to select right prouct and cut those products which are selected incorrectly (explanation below)

install fuzzywuzzy:

pip install fuzzywuzzy

fuzzywuzzy has several methods for a ratio calculation (examples on github). You face the problem: how to choose the best? I tried them on your data, but all of them faliled. Code:

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz

# df1 = ...
# df2 = ...

def get_top_by_ratio(x, df2):
    product_values = df2.Product.values
    # compare two strings by characters
    ratio = np.array([fuzz.partial_ratio(x, val) for val in product_values])
    argmax = np.argmax(ratio)
    rating = ratio[argmax]
    linked_product = product_values[argmax]
    return rating, linked_product

Aplly this function to your data:

partial_ratio = (df1.Brand_var.apply(lambda x: get_top_by_ratio(x, df2))
                    .apply(pd.Series)  # convert returned Series of tuples into pd.DataFrame
                    .rename(columns={0: 'ratio', 1: 'Product'}))  # just rename columns
print(partial_ratio)
Out:
0     65  1960 Altmeister 330ML CAN METAL  # Altmeister Bitter 
1     50                             test  # Altos Las Hormigas Argentinian Wine
2     33                             test
3     50                             test
4     50                             test

That's not good. Other ratio methods as fuzz.ratio, fuzz.token_sort_ratio etc. had failed too.

So I guess extend heuristic to compare words not only characters might help. Define a function that will create vocabulary from your data, encode all the sentences and use more sophisticated heuristic looking for words too:

def create_vocab(df1, df2):
     # Leave 0 index free for unknow words
    all_words = set((df1.Brand_var.str.cat(sep=' ') + df2.Product.str.cat(sep=' ')).split())
    vocab = dict([(i + 1, w) for i, w in enumerate(all_words)])
    return vocab


def encode(string, vocab):
    """This function encodes a sting with vocabulary"""
    return [vocab[w] if w in vocab else 0 for w in string.split()]

Define new heuristic:

def get_top_with_heuristic(x, df2, vocab):
    product_values = df2.Product.values
    # compare two strings by characters
    ratio_per_char = np.array([fuzz.partial_ratio(x, val) for val in product_values])
    # compare two string by words
    ratio_per_word = np.array([fuzz.partial_ratio(x, encode(val, vocab)) for val in product_values])
    ratio = ratio_per_char + ratio_per_word
    argmax = np.argmax(ratio)
    rating = ratio[argmax]
    linked_product = product_values[argmax]
    return rating, linked_product

Create vocabulary, apply sophisticated heuristic to the data:

vocab = create_vocab(df1, df2)
heuristic_rating = (df1.Brand_var.apply(lambda x: get_top_with_heuristic(x, df2, vocab))
                    .apply(pd.Series)
                    .rename(columns={0: 'ratio', 1: 'Product'}))
print(heuristic_rating)
Out: 
   ratio                          Product
0     73  1960 Altmeister 330ML CAN METAL  # Altmeister Bitter 
1     61               Hormi 12 Yr Bottle  # Altos Las Hormigas Argentinian Wine
2     45               Hormi 12 Yr Bottle
3     50                             test
4     50                             test

It seems to be correct! Concatenate this dataframe to df1, change index:

result_heuristic = pd.concat((df1, heuristic_rating), axis=1).set_index('Brand_var')
print(result_heuristic)
Out:

                                     ratio                          Product
Brand_var                                                                  
Altmeister Bitter                       73  1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine     61               Hormi 12 Yr Bottle
Amadeus Contri Sparkling Wine           45               Hormi 12 Yr Bottle
Amadeus Cream Liqueur                   50                             test
Amadeus Sparkling Sparkling Wine        50                             test

Now you should choose some rule of the thumb to cut incorrect data. For this example ratio <= 50 works good, but you probably need some research to define best heuristic and correct threshold. Also you will get some errors anyway. Choose acceptable error rate ,i.e 2%, 5% ... and improve your algorithm until you reach it (This task is similar to validation of machine learning classification algorithms).

Cut incorrect "predictions":

result = result_heuristic[result_heuristic.ratio > 50][['Product']]
print(result)

Out:                                                        Product
Brand_var                                                           
Altmeister Bitter                    1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine               Hormi 12 Yr Bottle

Hope it helps!

P.S. of course, this algorithm is very very slow, when you'optimize' it you should do some optimizations, for example, cache the diffs etc.

Thanx but its not picking correct values as my data may have some ambiguity, I am closing this question and will post a fresh one with a partial solution I found in SO. :) Thanks again — anky, Aug 30 '18 at 05:46
@anky_91OK :) I think there'll always be some ambiguity with such tasks but if you find an answer that works without errors or with extremely low error rate, please, post a link to it — Mikhail Stepanov, Aug 30 '18 at 08:21
Surely: https://stackoverflow.com/questions/38577332/apply-fuzzy-matching-across-a-dataframe-column-and-save-results-in-a-new-column this i found matches my requirement better. :) — anky, Aug 30 '18 at 09:00

fuzzy lookup between 2 series/columns of nonidentical lengths

1 Answers1