Having a really tough time with this one. Say I have two dataframes, one that has fruits and another one that has types of fruit candy. There's lots of other data in each of the dataframes. So it looks something like this:
fruit:
fruitId fruitName
0 1 banana
1 2 orange
2 3 apple
3 4 pear
4 5 lemon
candy:
candyId candyName fruitId
0 1 Orange Julius null
1 2 Bananarama null
2 3 Sour Lemon Drops null
3 4 Chocolate Bar null
4 5 Applicious null
I need to match the candyName
with the proper fruit, and then put the corresponding fruitId
into the fruitId
column in the candy
dataframe. Let's assume for my purposes that .contains
doesn't work at all; there too many creative spellings and outright misspellings in the candyName
column.
I have tried to define a function that uses fuzzywuzzy, and then use that in .map
, but I can't get the function to work. It needs to check each value of the first df to see if it's in the second, and then move onto the next value, etc. The functions I end up building keep wanting to do comparisons where they're either (a) in the same dataframe, or (b) in the same row.
I did find a solution to this, but it's ugly because it uses iterrows() which you're not supposed to use. Here it is:
import pandas as pd
from fuzzywuzzy import fuzz
candy_file = 'candy.csv'
fruit_file = 'fruits.csv'
candy = pd.read_csv(candy_file)
fruit = pd.read_csv(fruit_file)
dict = {}
for i, row1 in candy.iterrows():
for j, row2 in fruit.iterrows():
if fuzz.partial_ratio(row1['candyName'], row2['fruitName']) >= 80:
dict[row1['candyName']] = row2['fruitId']
candy['fruitId'] = candy['candyName'].map(dict)
This takes forever. Like, 10 minutes to get through 500 rows. Is there a better way to do this? I've written like a hundred different code snippets out for faster functions without getting anywhere.
Thanks!