I have 2 dataframes df1 and df2 like this:
df1:
Id Name
1 Tuy Hòa
2 Kiến thụy
3 Bình Tân
df2:
code name
A1 Tuy Hoà
A2 Kiến Thụy
A3 Tân Bình
Now when I use merge:
out_df = pd.merge(df1, df2, left_on=['Name'], right_on=['name'], how='outer')
Of course the result is null, because example for Tuy Hòa
<> Tuy Hoà
(problem is the position sign "`"),...
Then I use fuzzy function and apply lambda but it's not correct at all :(
My function:
def checker(Name,names):
correct_name = ''
correct_ratio = 0
if Name in names:
ratio = 100
else:
try:
x = process.extractOne(Name, names, scorer=fuzz.token_set_ratio, score_cutoff = 80)
correct_name = x[0]
#correct_ratio = x[1]
except:
correct_name = ''
#correct_ratio = 0
return correct_name
Then I convert:
names = df2['name'].tolist()
After that I use apply to loop through, compare and output the correct (df1 must base on df2 to correct)
df1['Name'] = df1['Name'].apply(lambda x: checker(x,names) if not checker(x,names) else x)
The output now is:
df1
Id Name
1 Tuy Hòa
2 Kiến Thụy
3 Tân Bình
Seem like Tuy Hòa
and Kiến Thụy
are correct but Bình Tân
and Tân Bình
are different name! so changing position charaters in this case is correct not error in typing :( The expect output should be:
df1
Id Name
1 Tuy Hòa
2 Kiến Thụy
3 Bình Tân
I try to use fuzz.ratio
with score_cutoff = 55
but some case this will miss pairs
Any solution is better than my solution? (maybe not using fuzzy library?) Please help me. Many thanks to your help <3