0

Having a really tough time with this one. Say I have two dataframes, one that has fruits and another one that has types of fruit candy. There's lots of other data in each of the dataframes. So it looks something like this:

fruit: 

   fruitId  fruitName
0     1     banana
1     2     orange
2     3     apple
3     4     pear
4     5     lemon

candy:

   candyId  candyName         fruitId
0     1     Orange Julius     null
1     2     Bananarama        null
2     3     Sour Lemon Drops  null
3     4     Chocolate Bar     null
4     5     Applicious        null

I need to match the candyName with the proper fruit, and then put the corresponding fruitId into the fruitId column in the candy dataframe. Let's assume for my purposes that .contains doesn't work at all; there too many creative spellings and outright misspellings in the candyName column.

I have tried to define a function that uses fuzzywuzzy, and then use that in .map, but I can't get the function to work. It needs to check each value of the first df to see if it's in the second, and then move onto the next value, etc. The functions I end up building keep wanting to do comparisons where they're either (a) in the same dataframe, or (b) in the same row.

I did find a solution to this, but it's ugly because it uses iterrows() which you're not supposed to use. Here it is:

import pandas as pd
from fuzzywuzzy import fuzz

candy_file = 'candy.csv'
fruit_file = 'fruits.csv'
candy = pd.read_csv(candy_file)
fruit = pd.read_csv(fruit_file)

dict = {}

for i, row1 in candy.iterrows():
    for j, row2 in fruit.iterrows():
        if fuzz.partial_ratio(row1['candyName'], row2['fruitName']) >= 80:
            dict[row1['candyName']] = row2['fruitId']

candy['fruitId'] = candy['candyName'].map(dict)

This takes forever. Like, 10 minutes to get through 500 rows. Is there a better way to do this? I've written like a hundred different code snippets out for faster functions without getting anywhere.

Thanks!

EnsAndBees
  • 73
  • 9

1 Answers1

1

It's slow because you're currently working in O(N^2).

Rather than using iterrows, use dictionaries to iterate instead. This can be done with the following:

candydict = candy.to_dict{}
fruitdict = fruit.to_dict{}

for k,v in candydict.items():
   for k2,v2 in fruitdict.items():
      #do the rest of your comparisons here

This should speed it up significantly.

ldren
  • 159
  • 5
  • This is great, thank you! How would I reference the columns, then? `v['candyName']` throws KeyError. – EnsAndBees Aug 03 '20 at 04:45
  • k,v stands for keys and values, so 'candyName' would be one of the k's. let me know if you need more clarification! – ldren Aug 03 '20 at 05:30
  • Thanks so much for your help so far. If I do `k['candyName']` I get `TypeError: string indices must be integers`. What am I missing? – EnsAndBees Aug 03 '20 at 17:16
  • I got it. I had to use `for k, v in candydict['candyName'].items()`. Thanks again! – EnsAndBees Aug 03 '20 at 20:15