I have this code that matches a word from df2 to df1 column account_name and returns it's associated category. The code flow goes as follows:
>>> import pandas as pd
>>> data = {'account_name':['prepaid', 'postpaid', 'books', 'stationary','software','printer', 'mouse'], 'category':['admin','admin','admin','admin','it','it','it']}
>>> df1 = pd.DataFrame(data)
>>> df1
account_name category
0 prepaid admin
1 postpaid admin
2 books admin
3 stationary admin
4 software it
5 printer it
6 mouse it
>>> data2 = {'account_name':['stationary costs', 'prepaid expenses', 'postpaid expenses', 'mouse', 'software expenses']}
>>> df2 =pd.DataFrame(data2)
>>> df2
account_name
0 stationary costs
1 prepaid expenses
2 postpaid expenses
3 mouse
4 software expenses
>>> d = df1.set_index('account_name')['category'].to_dict()
>>> f = lambda x: next(iter(d[y] for y in x.split() if y in d))
>>> df2['category'] = df2['account_name'].apply(f)
>>> df2
account_name category
0 stationary costs admin
1 prepaid expenses admin
2 postpaid expenses admin
3 mouse it
4 software expenses it
This, however, has some flaws, it breaks down when there is a word in df2 that doesn't exactly have a match in df1['account_name'] for e.g. if I have a word 'rent' in df2['account_name'] it will break. Also, if I have a partial word in df2['account_name'] like "mous" instead of the full word "mouse" then it will also break.
What I would like to do is return the third column in df2 with match percentage which is basically based on the partial or full match word to word (more like fuzzy matching) and return the corresponding category if the match percentage is greater than 70% also a column with a match percentage.
Here's the expected output
>>> df2
account_name category match_perc
stationary costs admin 71
prepaid expenses admin 72
postpaid expenses admin 72
mouse it 100
software expenses it 73
chair n/a 0
table n/a 0
mous it 90