If you install fuzzywuzzy
, you still stay with a problem how to choose proper heuristic to select right prouct and cut those products which are selected incorrectly (explanation below)
install fuzzywuzzy
:
pip install fuzzywuzzy
fuzzywuzzy
has several methods for a ratio calculation (examples on github). You face the problem: how to choose the best? I tried them on your data, but all of them faliled.
Code:
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
# df1 = ...
# df2 = ...
def get_top_by_ratio(x, df2):
product_values = df2.Product.values
# compare two strings by characters
ratio = np.array([fuzz.partial_ratio(x, val) for val in product_values])
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
Aplly this function to your data:
partial_ratio = (df1.Brand_var.apply(lambda x: get_top_by_ratio(x, df2))
.apply(pd.Series) # convert returned Series of tuples into pd.DataFrame
.rename(columns={0: 'ratio', 1: 'Product'})) # just rename columns
print(partial_ratio)
Out:
0 65 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 50 test # Altos Las Hormigas Argentinian Wine
2 33 test
3 50 test
4 50 test
That's not good. Other ratio methods as fuzz.ratio
, fuzz.token_sort_ratio
etc. had failed too.
So I guess extend heuristic to compare words not only characters might help. Define a function that will create vocabulary from your data, encode all the sentences and use more sophisticated heuristic looking for words too:
def create_vocab(df1, df2):
# Leave 0 index free for unknow words
all_words = set((df1.Brand_var.str.cat(sep=' ') + df2.Product.str.cat(sep=' ')).split())
vocab = dict([(i + 1, w) for i, w in enumerate(all_words)])
return vocab
def encode(string, vocab):
"""This function encodes a sting with vocabulary"""
return [vocab[w] if w in vocab else 0 for w in string.split()]
Define new heuristic:
def get_top_with_heuristic(x, df2, vocab):
product_values = df2.Product.values
# compare two strings by characters
ratio_per_char = np.array([fuzz.partial_ratio(x, val) for val in product_values])
# compare two string by words
ratio_per_word = np.array([fuzz.partial_ratio(x, encode(val, vocab)) for val in product_values])
ratio = ratio_per_char + ratio_per_word
argmax = np.argmax(ratio)
rating = ratio[argmax]
linked_product = product_values[argmax]
return rating, linked_product
Create vocabulary, apply sophisticated heuristic to the data:
vocab = create_vocab(df1, df2)
heuristic_rating = (df1.Brand_var.apply(lambda x: get_top_with_heuristic(x, df2, vocab))
.apply(pd.Series)
.rename(columns={0: 'ratio', 1: 'Product'}))
print(heuristic_rating)
Out:
ratio Product
0 73 1960 Altmeister 330ML CAN METAL # Altmeister Bitter
1 61 Hormi 12 Yr Bottle # Altos Las Hormigas Argentinian Wine
2 45 Hormi 12 Yr Bottle
3 50 test
4 50 test
It seems to be correct! Concatenate this dataframe to df1, change index:
result_heuristic = pd.concat((df1, heuristic_rating), axis=1).set_index('Brand_var')
print(result_heuristic)
Out:
ratio Product
Brand_var
Altmeister Bitter 73 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine 61 Hormi 12 Yr Bottle
Amadeus Contri Sparkling Wine 45 Hormi 12 Yr Bottle
Amadeus Cream Liqueur 50 test
Amadeus Sparkling Sparkling Wine 50 test
Now you should choose some rule of the thumb to cut incorrect data. For this example ratio <= 50
works good, but you probably need some research to define best heuristic and correct threshold. Also you will get some errors anyway. Choose acceptable error rate ,i.e 2%, 5% ... and improve your algorithm until you reach it (This task is similar to validation of machine learning classification algorithms).
Cut incorrect "predictions":
result = result_heuristic[result_heuristic.ratio > 50][['Product']]
print(result)
Out: Product
Brand_var
Altmeister Bitter 1960 Altmeister 330ML CAN METAL
Altos Las Hormigas Argentinian Wine Hormi 12 Yr Bottle
Hope it helps!
P.S. of course, this algorithm is very very slow, when you'optimize' it you should do some optimizations, for example, cache the diffs etc.