Is there a way to modify this code to reduce run time?

Question

so I am looking to modify this code to reduce runtime of fuzzywuzzy library. At present, it's taking about an hour for a dataset with 800 rows, and when I used this on a dataset with 4.5K rows, it kept running for almost 6 hours, still no result. I had to stop the kernel.

I need to use this code on a data of 20K atleast. Can anyone please suggest any edits to this code to get the results faster? This is the code -

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process

df = pd.read_csv(r'path')
df.head()

data = df['Body']
print(data)

clean = []
threshold = 80 
for row in data:
  # score each sentence against each other
  # [('string', score),..]
  scores = process.extract(row, data, scorer=fuzz.token_set_ratio)
  # basic idea is if there is a close second match we want to evaluate 
  # and keep the longer of the two
  if scores[1][1] > threshold:
     clean.append(max([x[0] for x in scores[:2]],key=len))
  else:
     clean.append(scores[0][0])

# remove dupes
clean = set(clean)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean, columns=['Body'])

clean_data.to_csv(r'path')

This is how my data looks like -

https://docs.google.com/spreadsheets/d/1p9RC9HznhdJFH4kFYdE_TgnHdoRf8P6gTEAkB3lQWEE/edit?usp=sharing

So if you notice rows 14&15, and rows 19&20 are partial duplicates, I want the code to identify such sentences, and drop the shorter ones.

Update -

I made a minor change to the rapidfuzz solution given by @Darryl G, and now the code looks like this -

`import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

df = pd.read_excel(r'path')

data = df['Body']
print(data)

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)


clean_rapid = []
threshold = 80 

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
        if len(scores) > 1 and scores[1][1] > threshold:
            m = max(scores[:2], key = lambda k:len(k[0]))                # Of up to two matches above threshold, takes longest
            clean_rapid.append(m[0])                                    # Saving the match index
        else:
            clean_rapid.append(query)

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe(r'path')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

# remove dupes
clean_rapid = set(clean_rapid)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean_rapid, columns=['Body'])

#exporting the cleaned data
clean_data.to_excel(r'path')`

Now the issue is, in the output file, all the full stops, etc are getting dropped. How can I retain them?

Check out the answer by maxbachmann in [Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column](https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column) which produced 10X improvement for a similar problem. — DarrylG, Jul 22 '21 at 11:31
@AndyKnight Hi, I have added a snippet of what my data looks like. Hope it helps — Shrumo, Jul 22 '21 at 12:25
@Shrumo--isn't the reference just using a single column of one data frame (i.e. column 'org_name'). Doesn't it use fuzzy_wuzzy to find the closest matches of each row in the entire column? This seems similar to what you're doing. — DarrylG, Jul 22 '21 at 12:27
@DarrylG Hi, I did see that solution, however, my goal is to identify such duplicates, and drop it. the code I shared does the job, just that its very time consuming and impractical for a bid dataset. Hoping for some inputs. — Shrumo, Jul 22 '21 at 12:29
If possible you should place the data as text rather than an image. This way others can try processing it. Even better you could provide a link to an online data file. — DarrylG, Jul 22 '21 at 12:29
@DarrylG Oh, Let me try to post a link to the data. Thanks for the suggestion ! — Shrumo, Jul 22 '21 at 12:32
@DarrylG I have pasted the link to the data, for your reference — Shrumo, Jul 22 '21 at 12:38

DarrylG · Answer 1 · 2021-07-23T05:07:00.747

The approach is based upon RapidFuzz from an answer in Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column.

Result

OP Fuzzy Wuzzy method) : 2565.7 seconds
RapidFuzz method: 649.5 seconds

Thus: 4X improvement

Note: Test Data ~2K records from OP Google Sheet Data downloaded to a local Excel workbook.

Rapid Fuzz Implementation

import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    clean_rapid = []
    threshold = 80 
    n = 0
    for query in processed_data:
        scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
        
        m = max(scores[:2], key = lambda k:len(k[0]))                # Of up to two matches above threshold, takes longest
        clean_rapid.append(m[-1])                                    # Saving the match index
        
    clean_rapid = set(clean_rapid)                                   # remove duplicate indexes

    return data[clean_rapid]                                         # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe('Duplicates1.xlsx')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

Version of Posted Code Used for Comparison

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import openpyxl
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)

def process_fuzzy_wuzzy(data):
    clean = []
    threshold = 80 
   
    for idx, query in enumerate(data):
        # score each sentence against each other
        # [('string', score),..]
        scores = process.extract(query, data, scorer=fuzz.token_set_ratio)
        # basic idea is if there is a close second match we want to evaluate 
        # and keep the longer of the two
        if len(scores) > 1 and scores[1][1] > threshold:    # If second one is close
            m = max(scores[:2], key=lambda k:len(k[0]))
            clean.append(m[-1])
        else:
            clean.append(idx)

    # remove duplicates
    clean = set(clean)
    return data[clean]                                        # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
# Get DataFrame for sheet from Excel
df = excel_sheet_to_dataframe('Duplicates1.xlsx')  

# Will Process data in 'body' column of DataFrame
data = df['Body'].dropna()                                    # Dropping None rows (few None rows at end after Excel import)

# Process Data (Pandas Series)
result_fuzzy_wuzzy = process_fuzzy_wuzzy(data)
print(f'Elapsed time {time.time() - t0}')

@DarryIG Thanks a lot ! i'll just try this code out. I am assuming the second code is modification in my code, so either of this will work? Sorry, I'm still new to this and there are some new functions i see in this code, just wanted to confirm — Shrumo, Jul 23 '21 at 05:06
@Shrumo--it's basically your code but slightly modified with main difference being to use Excel files. — DarrylG, Jul 23 '21 at 05:09
@Shrumo--used Excel rather than the Google Sheet version since using the API is more involved. — DarrylG, Jul 23 '21 at 05:11
Okay, got it ! I'll keep you posted on how it performs. Really appreciate the help ! — Shrumo, Jul 23 '21 at 05:14
@Shrumo--just noticed my runtimes are much better than yours using your original code. I'm using an old computer (~7 years old) so it shouldn't be that. Perhaps my mod of using indexes rather than actual data sped up your original code also. — DarrylG, Jul 23 '21 at 05:19
@DarryIG oh, that's surprising. I'm sure it's not the system thing, as mine isn't that old. Can you elaborate on the indexes part? I'm trying to understand all aspects of this. Any info would be useful — Shrumo, Jul 23 '21 at 05:24
@Shrumo--rather than store the strings in the array (which are pretty large), I just store the index of the matches. Then at the end I retrieve the out strings by using the indexes to retrieve the items from the Pandas data series. For instance, with index [0, 1, 5] I would take string at index 0, 1, and 5 in the original Pandas series (i..e data is the original pandas series). — DarrylG, Jul 23 '21 at 05:29
@DarryIG Oh, Okay...got it.. Also I was just trying out the second code (which was a modification of my original code), and I was just wondering, is there any reason to add this line before the for loop - `def process_fuzzy_wuzzy(data):
clean []
return data[clean]` — Shrumo, Jul 23 '21 at 05:40
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235205/discussion-between-shrumo-and-darrylg). — Shrumo, Jul 23 '21 at 07:20

score 1 · Accepted Answer · answered Jul 29 '21 at 11:24

This answers the second part of your question. The processed_data holds preprocessed strings, so the query is already preprocessed. The preprocessing is done by process.extract by default. DarrylG moved this preprocessing in front of the loop, so strings are not preprocessed multiple times. In case you do not want the strings to be compared without preprocessing them you can directly iterate over the original data: change:

series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)   

for query in processed_data:

to

for query in data:

In case you want the original behavior, but want the unprocessed string in the results you can use the index of the result string to extract the unprocessed string.

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query, processed_data,
            scorer=rapid_token_set_ratio,
            score_cutoff=threshold,
            limit=2)
        m = max(scores[:2], key = lambda k:len(k[0]))
        clean_rapid.append(data[m[2]])

There are a couple of further improvements which are possible in the implementation:

You can make sure the current query will not be matched by replacing it with None in the processed_data and then use process.extractOne to find the next best match above the threshold. This is at least as fast as process.extract and might be significantly faster.
Your comparing each element of processed_data to each element of processed_data. This means your always performing both the comparision data[n] <-> data[m] and data[m] <-> data[n] even though they are guaranteed to have the same result. Only performing the comparison once should save around 50% of the runtime.

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for idx, query in enumerate(processed_data):
        # None is skipped by process.extract/extractOne, so it will never be part of the results
        processed_data[idx] = None
        match = process_rapid.extractOne(query, processed_data,
            scorer=rapid_token_set_ratio,
            score_cutoff=threshold)
        # compare the length using the original strings
        # alternatively len(match[0]) > len(query)
        # if you do want to compare the length of the processed version
        if match and len(data[match[2]]) > len(data[idx]):
            clean_rapid.append(data[match[2]])
        else:
            clean_rapid.append(data[idx])

Sorry for this question, in advance. In this line that you suggested - 'clean_rapid.append(data[m[2]])'. Is this for the 'if' part of the code or the 'else'? — Shrumo, Jul 29 '21 at 11:56
You do not need the if + else in this case, since the query is part of the results from process.extract in this version -> you have either one ore two results. In both of these cases max works. — maxbachmann, Jul 29 '21 at 12:23
oh,okay... so when I run the code in one of my datasets, I got one error which said - "Type error: Sentence must be a string". My assumption is there might be some row which does not have value as string. But Suppose I want to identify which row is causing this issue, how can I do that? Because when I search using filter in excel, I can't find any value which is a non-string — Shrumo, Jul 29 '21 at 13:12
Yes, I did.. I mean, I first removed all the blanks from the data, and then run this code — Shrumo, Jul 29 '21 at 13:29
strange. Maybe catch the exception and print the object + index it fails on — maxbachmann, Jul 29 '21 at 13:42
Yea, That's where I'm struggling. I can't seem to locate which row is creating this issue. — Shrumo, Jul 29 '21 at 13:45
Also, are there any caveats to rapidfuzz? I mean, suppose a string starts with '//' or '[]' or any such characters, does it not consider it a string and maybe skips it in the output file? — Shrumo, Jul 29 '21 at 13:46
It only ignores elements which are None. Currently only strings are supported, but this will change in the next release: https://github.com/maxbachmann/RapidFuzz/issues/100 — maxbachmann, Jul 29 '21 at 13:50
the None type in Python. E.g. numpy.nan will throw a TypeError, since I can not detect it without having numpy as dependency (will change when I need a Numpy dependency in the next release anyways) — maxbachmann, Jul 29 '21 at 14:02

Is there a way to modify this code to reduce run time?

2 Answers2

Linked