6

I have two data frames with name list

df1[name]   -> number of rows 3000

df2[name]   -> number of rows 64000

I am using fuzzy wuzzy to get the best match for df1 entries from df2 using the following code:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

matches = [process.extract(x, df1, limit=1) for x in df2]

But this is taking forever to finish. Is there any faster way to do the fuzzy matching of strings in pandas?

Hariom Singh
  • 3,512
  • 6
  • 28
  • 52
kunal deep
  • 131
  • 1
  • 10
  • 2
    Are the names unique? If not, you can speed it up by caching. Also, have you installed the python-levenshtein module? That speeds it up a lot (results may be slightly different). – Paulo Almeida Aug 16 '17 at 03:28
  • Hey @PauloAlmeida yes I already have added python-levenshtein module. Names are unique. – kunal deep Aug 16 '17 at 13:52

2 Answers2

5

One improvement i can see in your code is to use generator, so instead of square brackets, you can use round brackets. it will increase the speed by multiple time.

matches = (process.extract(x, df1, limit=1) for x in df2)

Edit: One more suggestion, we can parallelize the operation with multiprocessing library.

StatguyUser
  • 2,595
  • 2
  • 22
  • 45
  • Glad i could be of help!! – StatguyUser Aug 21 '17 at 16:34
  • @Enthusiast Could you please tell me how did you append back the match in DF back. c = [process.extract(x, df1['Name'], limit=5) for x in df2['Name']] My Code fetches the list like below. I need to append back the same in df1. [[(' Hong Kong', 100, 0)]] – Maneet Giri Nov 05 '18 at 10:31
  • @StatguyUser hi. How did you use this to index the matches within the dataframe? – Aquiles Páez Nov 14 '20 at 02:44
  • @AquilesPáez, convert it to a list and add as a new column. Although it will defeat the purpose why we had to use generator. Haven't checked if we can directly add as a column without converting. – StatguyUser Nov 14 '20 at 09:14
  • Generator seems to be faster just because it is helping the execution to be more responsive by not waiting until the complete execution of the loop. But if you do need to generate all the matches will it be helpful in the long run? More twists here: https://stackoverflow.com/a/31766906/6907424 – hafiz031 Jul 28 '21 at 03:14
0

You can use python's multithreading package to speed it up. Pandas doesn't leverage multi cores by default.