17

I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000 rows. After having looked through StackOverflow I have tried implementing a lambda row (apply) method but that seems to barely speed things up, if at all.

The first four rows of the dataframe look like this:

index  org_name
0   cliftonlarsonallen llp minneapolis MN
1   loeb and troper llp newyork NY
2   dauby o'connor and zaleski llc carmel IN
3   wegner cpas llp madison WI

The following code block works but took around five days to process:

org_list = df['org_name']
from fuzzywuzzy import process
for index, row in df.iterrows():
    x = process.extract(row['org_name'], org_list, limit=2)[1]
    if x[1]>93:
        df.loc[index, 'fuzzy_match'] = x[0]
        df.loc[index, 'fuzzy_match_score'] = x[1]

In effect, for each row I am comparing the organization name against the list of all organization names, taking the top two matches, then selecting the second-best match (because the top match will be the identical name), and then setting a condition that the score must be higher than 93 in order to create the new columns. The reason I'm creating additional columns is that I do not want to simply replace values -- I'd like to double-check the results first.

Is there a way to speed this up? I read several blog posts and StackOverflow questions that talked about 'vectorizing' this code but my attempts at that failed. I also considered simply creating a 70,000 x 70,000 Levenshtein distance matrix and then extracting information from there. Is there a quicker way to generate the best match for each element in a list or PANDAS column?

smci
  • 32,567
  • 20
  • 113
  • 146
Gregory Saxton
  • 1,241
  • 4
  • 13
  • 29

4 Answers4

39

Given your task your comparing 70k strings with each other using fuzz.WRatio, so your having a total of 4,900,000,000 comparisions, with each of these comparisions using the levenshtein distance inside fuzzywuzzy which is a O(N*M) operation. fuzz.WRatio is a combination of multiple different string matching ratios that have different weights. It then selects the best ratio among them. Therefore it even has to calculate the Levenshtein distance multiple times. So one goal should be to reduce the search space by excluding some possibilities using a way faster matching algorithm. Another issue is that the strings are preprocessed to remove punctuation and to lowercase the strings. While this is required for the matching (so e.g. a uppercased word becomes equal to a lowercased one) we can do this ahead of time. So we only have to preprocess the 70k strings once. I will use RapidFuzz instead of FuzzyWuzzy here, since it is quite a bit faster (I am the author).

The following version performs more than 10 times as fast as your previous solution in my experiments and applies the following improvements:

  1. it preprocesses the strings ahead of time

  2. it passes a score_cutoff to extractOne so it can skip calculations where it already knows they can not reach this ratio

import pandas as pd, numpy as np
from rapidfuzz import process, utils

org_list = df['org_name']
processed_orgs = [utils.default_process(org) for org in org_list]

for (i, processed_query) in enumerate(processed_orgs):
    # None is skipped by extractOne, so we set the current element to None an
    # revert this change after the comparision
    processed_orgs[i] = None
    match = process.extractOne(processed_query, processed_orgs, processor=None, score_cutoff=93)
    processed_orgs[i] = processed_query
    if match:
        df.loc[i, 'fuzzy_match'] = org_list[match[2]]
        df.loc[i, 'fuzzy_match_score'] = match[1]

Here is a list of the most relevant improvements of RapidFuzz to make it faster than FuzzyWuzzy in this example:

  1. It is implemented fully in C++ while a big part of FuzzyWuzzy is implemented in Python

  2. When calculating the levenshtein distance it takes into account the score_cutoff to choose an optimized implementation based. E.g. when the length difference between the strings is to big it can exit in O(1).

  3. FuzzyWuzzy uses Python-Levenshtein to calculate the similarity between two strings, which uses a weightened Levenshtein distance with a weight of 2 for substitutions. This is implemented using Wagner-Fischer. RapidFuzz on the other hand uses a bitparallel implementation for this based on BitPal, which is faster

  4. fuzz.WRatio is combining the results of multiple other string matching algorithms like fuzz.ratio, fuzz.token_sort_ratio and fuzz.token_set_ratio and takes the maximum result after weighting them. So while fuzz.ratio has a weighting of 1 fuzz.token_sort_ratio and fuzz.token_set_ratio have one of 0.95. When the score_cutoff is bigger than 95 fuzz.token_sort_ratio and fuzz.token_set_ratio are not calculated anymore, since the results are guaranteed to be smaller than the score_cutoff

  5. In process.extractOne RapidFuzz avoids calls through Python whenever possible and preprocesses the query once ahead of time. E.g. the BitPal algorithm requires one of the two strings which are compared to be stored into a bitvector which takes a big part of the algorithms runtime. In process.extractOne the query is stored into this bitvector only once and the bitvector is reused afterwards making the algorithm a lot faster.

  6. since extractOne only searches for the best match it uses the ratio of the current best match as score_cutoff for the next elements. This way it can quickly discard more elements by using the improvements to the levenshtein distance calculation from 2) in many cases. When it finds a element with a similarity of 100 it exits early since there can't be a better match afterwards.

maxbachmann
  • 2,862
  • 1
  • 11
  • 35
  • 3
    I love the library! It took down the time for matching from 66 hours to about 90 minutes. – isht3 May 13 '20 at 07:36
  • This library is fantastic, I had similar performance improvements to the above. – tom Jun 04 '20 at 09:20
  • Great library. Just tried it on a problem and it reduced the time by 16X over fuzzywuzzy. – DarrylG Jul 24 '20 at 18:44
  • Does your library work with multidimensional scores ? I.e. tuple-contained scores ? FW does. I had a look to your sources, and it appears to deal with scalar scores exclusively, right ? – keepAlive Aug 10 '20 at 23:42
  • Yes it essentially resembles the behaviour of fuzzywuzzy which should only work with scalar values (and key value pairs like dictionaries or pandas Dataseries) aswell. I did not really have the requirement myself yet and especially do not really know how to perform on matrixes with string matching in a way that works for everyone (in case you have a good idea for this feel free to open a issue on it ;)). – maxbachmann Aug 11 '20 at 00:01
  • Thx for your quick reply :) FuzzyWuzzy does work with tuple scores. Try, e.g., `fzp.extractOne(query='stack', choices=['stick', 'stok'], processor=None, scorer=lambda s1, s2: (fzz.WRatio(s1, s2), fzz.QRatio(s1, s2)), score_cutoff=(0, 0))`. This approach allows for lexicographical ordering of matches. Say two matches have the same first score, they may not have so with the second one. – keepAlive Aug 11 '20 at 09:24
  • Actually, the two sets of lines that prevent tuple scores are [***(170:172)***](https://github.com/maxbachmann/rapidfuzz/blob/master/src/rapidfuzz/process.py#L170) and [***(189:191)***](https://github.com/maxbachmann/rapidfuzz/blob/master/src/rapidfuzz/process.py#L189) in [process.py](https://github.com/maxbachmann/rapidfuzz/blob/master/src/rapidfuzz/process.py). When dealing with identical scores, fuzzywuzzy does simply return the first one it finds within (what appears to be) an unordered iterable... Admittedly debatable. – keepAlive Aug 11 '20 at 09:47
  • 1
    I agree with you that this usage should be allowed. I opened https://github.com/maxbachmann/rapidfuzz/issues/39 on this. – maxbachmann Aug 11 '20 at 11:07
  • @maxbachmann: Loved your rapidfuzz package. Just one query, I do have similar problem like above, but instead of 1st match..I want to put limit as 10 with threshold of > 80. How can I do with rapidfuzz? – Rahul Agarwal Sep 24 '20 at 05:50
  • @RahulAgarwal try `process.extract(query, choices, limit=10, score_cutoff=80)` – maxbachmann Sep 24 '20 at 07:40
  • 1
    Yes..thanks...saw your other answers and figured out!!! :) – Rahul Agarwal Sep 24 '20 at 07:44
  • hi @maxbachmann - can RapidFuzz be used in an AWS lambda? – JimmyStrings Aug 29 '22 at 22:25
  • I would expect rapidfuzz to work in AWS lambda, but I have never used it, so I can't tell for sure. – maxbachmann Aug 29 '22 at 22:54
  • Hi @maxbachmann. Could you also write a code snippet which can match from 2 different dataframes? – ranemak Feb 03 '23 at 14:52
4

This solution leverages apply() and should demonstrate reasonable performance improvements. Feel free to play around with the scorer and change the threshold to meet your needs:

import pandas as pd, numpy as np
from fuzzywuzzy import process, fuzz

df = pd.DataFrame([['cliftonlarsonallen llp minneapolis MN'],
        ['loeb and troper llp newyork NY'],
        ["dauby o'connor and zaleski llc carmel IN"],
        ['wegner cpas llp madison WI']],
        columns=['org_name'])

org_list = df['org_name']

threshold = 40

def find_match(x):

  match = process.extract(x, org_list, limit=2, scorer=fuzz.partial_token_sort_ratio)[1]
  match = match if match[1]>threshold else np.nan
  return match

df['match found'] = df['org_name'].apply(find_match)

Returns:

                                   org_name                                     match found
0     cliftonlarsonallen llp minneapolis MN             (wegner cpas llp madison WI, 50, 3)
1            loeb and troper llp newyork NY             (wegner cpas llp madison WI, 46, 3)
2  dauby o'connor and zaleski llc carmel IN                                             NaN
3                wegner cpas llp madison WI  (cliftonlarsonallen llp minneapolis MN, 50, 0)

If you would just like to return the matching string itself, then you can modify as follows:

match = match[0] if match[1]>threshold else np.nan

I've added @user3483203's comment pertaining to a list comprehension here as an alternative option as well:

df['match found'] = [find_match(row) for row in df['org_name']]

Note that process.extract() is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. For that reason, you will have to evaluate that query against all 70,000 match options (the way you currently have your code setup). So therefore, you will be evaluating len(match_options)**2 (or 4,900,000,000) string comparisons. Therefore, I think the best performance improvements could be achieved by limiting the potential match options via more extensive logic in the find_match() function, e.g. enforcing that the match options start with the same letter as the query, etc.

rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • 2
    You're going to get better performance using a list comprehension than using apply. Something like `[find_match(row) for row in df.org_name]` – user3483203 Oct 03 '18 at 16:36
  • Thanks, rahlf23. I appreciate this. However, I mentioned that I had already tried a row apply approach with no performance increase. With your code applied to the first two rows it took 11.7 seconds, while with the iterrows approach it took 11.6 seconds. – Gregory Saxton Oct 03 '18 at 16:39
  • Are you really checking each query against all 70,000 match options? I'm sure you can find a means to limit your match options for each query – rahlf23 Oct 03 '18 at 16:42
  • Thank you, @user3483203. That seems promising, but could you please fill in the blanks for me about how I would use that to generate a new column with the match? – Gregory Saxton Oct 03 '18 at 16:42
  • @rahlf23 -- yes, that's why I'm asking for help here. Any suggestions? – Gregory Saxton Oct 03 '18 at 16:43
  • Well that's where your knowledge of the your data is crucial. Can you limit your match options to start with the same letter as your query, as an example? – rahlf23 Oct 03 '18 at 16:45
  • Yes, thanks. That could work. I am still interested in whether there is a way to speed up the code with a vectorization approach or something similar. – Gregory Saxton Oct 03 '18 at 16:50
  • I don't believe there is, because `process.extract()` is designed to handle a single query string and apply the passed scoring algorithm to that query and the supplied match options. I will add some details to my answer in another edit. – rahlf23 Oct 03 '18 at 16:52
  • Thanks, @rahlf23. I hadn't thought of something as simple as limiting the match to first letters. That cut the time for 100 rows down from 11 minutes to 1 minute. – Gregory Saxton Oct 03 '18 at 17:10
  • Great! Just as a thought experiment, if you assume that your match options list contains an equal number of strings that start with each of the 26 letters in the alphabet, you could theoretically improve performance in that case by 26X. Like I said, you know the dataset better than I do, so I'm sure you can think of other ways to limit your match options and improve performance even more. – rahlf23 Oct 03 '18 at 17:13
  • Thanks again, and much appreciated. – Gregory Saxton Oct 03 '18 at 17:17
  • Have you tried nested loop, for me a 3000 rows with 2000 rows results is coming between 30-45 seconds. And I am also comparing organization names only – Rahul Agarwal Oct 09 '18 at 19:34
1

Using iterrows() is not recommended on dataframes, you could use apply() instead. But that probably wouldn't speed things up by much. What is slow is fuzzywuzzy's extract method where your input is compared with all 70k rows (string distance methods are computationally expensive). So if you intend to stick to fuzzywuzzy, one solution would be to limit your search for example to only those with the same first letter. Or if you have another column in your data that could be used as a hint (State, City, ...)

pedram bashiri
  • 1,286
  • 15
  • 21
0

My initial method is try to use spark via cluster to solve this or using multitasking parallel. But I found we can also improve the algorithm itself implementation like the rapidfuzzy.

import pandas as pd
from rapidfuzz import process, utils, fuzz
import multiprocessing
from multiprocessing import Process, Pool

def checker(wrong_option):
   if wrong_option in choices:
       ##orign_array.append(wrong_option)
       return wrong_option, wrong_option, 100
   else:
       x=process.extractOne(wrong_option, choices, processor=None, score_cutoff=0)
       return wrong_option, x[0], x[1]


if __name__ == '__main__':
  # setup cpu cores
  pool = Pool(multiprocessing.cpu_count())
  print("cpu counts:" + str(multiprocessing.cpu_count()))

  # map multiple tasks
  pool_outputs = pool.map(checker, checktokens)

  # create DataFrame using data
  df = pd.DataFrame(pool_outputs, columns=['Name', 'matched', 'Score'])

  # output
  print(df)
Garrett Hyde
  • 5,409
  • 8
  • 49
  • 55
Liu Feng
  • 1
  • 1