1

so I am looking to modify this code to reduce runtime of fuzzywuzzy library. At present, it's taking about an hour for a dataset with 800 rows, and when I used this on a dataset with 4.5K rows, it kept running for almost 6 hours, still no result. I had to stop the kernel.

I need to use this code on a data of 20K atleast. Can anyone please suggest any edits to this code to get the results faster? This is the code -

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz,process

df = pd.read_csv(r'path')
df.head()

data = df['Body']
print(data)

clean = []
threshold = 80 
for row in data:
  # score each sentence against each other
  # [('string', score),..]
  scores = process.extract(row, data, scorer=fuzz.token_set_ratio)
  # basic idea is if there is a close second match we want to evaluate 
  # and keep the longer of the two
  if scores[1][1] > threshold:
     clean.append(max([x[0] for x in scores[:2]],key=len))
  else:
     clean.append(scores[0][0])

# remove dupes
clean = set(clean)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean, columns=['Body'])

clean_data.to_csv(r'path') 

This is how my data looks like -

https://docs.google.com/spreadsheets/d/1p9RC9HznhdJFH4kFYdE_TgnHdoRf8P6gTEAkB3lQWEE/edit?usp=sharing

So if you notice rows 14&15, and rows 19&20 are partial duplicates, I want the code to identify such sentences, and drop the shorter ones.

Update -

I made a minor change to the rapidfuzz solution given by @Darryl G, and now the code looks like this -

`import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

df = pd.read_excel(r'path')

data = df['Body']
print(data)

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)


clean_rapid = []
threshold = 80 

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
        if len(scores) > 1 and scores[1][1] > threshold:
            m = max(scores[:2], key = lambda k:len(k[0]))                # Of up to two matches above threshold, takes longest
            clean_rapid.append(m[0])                                    # Saving the match index
        else:
            clean_rapid.append(query)

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe(r'path')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

# remove dupes
clean_rapid = set(clean_rapid)

#converting 'clean' list to dataframe and giving the column name for the cleaned column
clean_data = pd.DataFrame(clean_rapid, columns=['Body'])

#exporting the cleaned data
clean_data.to_excel(r'path')`

Now the issue is, in the output file, all the full stops, etc are getting dropped. How can I retain them?

Shrumo
  • 47
  • 7
  • Can you provide a small extract from your CSV file? –  Jul 22 '21 at 11:15
  • Check out the answer by maxbachmann in [Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column](https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column) which produced 10X improvement for a similar problem. – DarrylG Jul 22 '21 at 11:31
  • @AndyKnight Hi, I have added a snippet of what my data looks like. Hope it helps – Shrumo Jul 22 '21 at 12:25
  • @Shrumo--isn't the reference just using a single column of one data frame (i.e. column 'org_name'). Doesn't it use fuzzy_wuzzy to find the closest matches of each row in the entire column? This seems similar to what you're doing. – DarrylG Jul 22 '21 at 12:27
  • @DarrylG Hi, I did see that solution, however, my goal is to identify such duplicates, and drop it. the code I shared does the job, just that its very time consuming and impractical for a bid dataset. Hoping for some inputs. – Shrumo Jul 22 '21 at 12:29
  • If possible you should place the data as text rather than an image. This way others can try processing it. Even better you could provide a link to an online data file. – DarrylG Jul 22 '21 at 12:29
  • @DarrylG Oh, Let me try to post a link to the data. Thanks for the suggestion ! – Shrumo Jul 22 '21 at 12:32
  • @DarrylG I have pasted the link to the data, for your reference – Shrumo Jul 22 '21 at 12:38

2 Answers2

2

The approach is based upon RapidFuzz from an answer in Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column.

Result

  • OP Fuzzy Wuzzy method) : 2565.7 seconds
  • RapidFuzz method: 649.5 seconds

Thus: 4X improvement

Rapid Fuzz Implementation

import pandas as pd
import numpy as np
import openpyxl
from rapidfuzz.fuzz import token_set_ratio as rapid_token_set_ratio
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)       # Pre-process to make lower-case and remove non-alphanumeric 
                                                                   # characters (generator)
    processed_data = pd.Series(series)   

    clean_rapid = []
    threshold = 80 
    n = 0
    for query in processed_data:
        scores = process_rapid.extract(query, processed_data, scorer=rapid_token_set_ratio, score_cutoff=threshold)
        
        m = max(scores[:2], key = lambda k:len(k[0]))                # Of up to two matches above threshold, takes longest
        clean_rapid.append(m[-1])                                    # Saving the match index
        
    clean_rapid = set(clean_rapid)                                   # remove duplicate indexes

    return data[clean_rapid]                                         # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
df = excel_sheet_to_dataframe('Duplicates1.xlsx')   # Using Excel file in working folder

# Desired data in body column
data = df['Body'].dropna()                                           # Dropping None rows (few None rows at end after Excel import)

result_fuzzy_rapid = process_rapid_fuzz(data)
print(f'Elapsed time {time.time() - t0}')

Version of Posted Code Used for Comparison

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz, process
import openpyxl
import time

def excel_sheet_to_dataframe(path):
    '''
        Loads sheet from Excel workbook using openpyxl
    '''
    wb = openpyxl.load_workbook(path)
    ws = wb.active
    data = ws.values
     # Get the first line in file as a header line
    columns = next(data)[0:]
    
    return pd.DataFrame(data, columns=columns)

def process_fuzzy_wuzzy(data):
    clean = []
    threshold = 80 
   
    for idx, query in enumerate(data):
        # score each sentence against each other
        # [('string', score),..]
        scores = process.extract(query, data, scorer=fuzz.token_set_ratio)
        # basic idea is if there is a close second match we want to evaluate 
        # and keep the longer of the two
        if len(scores) > 1 and scores[1][1] > threshold:    # If second one is close
            m = max(scores[:2], key=lambda k:len(k[0]))
            clean.append(m[-1])
        else:
            clean.append(idx)

    # remove duplicates
    clean = set(clean)
    return data[clean]                                        # Get actual values by indexing to Pandas Series

################ Testing
t0 = time.time()
# Get DataFrame for sheet from Excel
df = excel_sheet_to_dataframe('Duplicates1.xlsx')  

# Will Process data in 'body' column of DataFrame
data = df['Body'].dropna()                                    # Dropping None rows (few None rows at end after Excel import)

# Process Data (Pandas Series)
result_fuzzy_wuzzy = process_fuzzy_wuzzy(data)
print(f'Elapsed time {time.time() - t0}')
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • @DarryIG Thanks a lot ! i'll just try this code out. I am assuming the second code is modification in my code, so either of this will work? Sorry, I'm still new to this and there are some new functions i see in this code, just wanted to confirm – Shrumo Jul 23 '21 at 05:06
  • 1
    @Shrumo--it's basically your code but slightly modified with main difference being to use Excel files. – DarrylG Jul 23 '21 at 05:09
  • @Shrumo--used Excel rather than the Google Sheet version since using the API is more involved. – DarrylG Jul 23 '21 at 05:11
  • Okay, got it ! I'll keep you posted on how it performs. Really appreciate the help ! – Shrumo Jul 23 '21 at 05:14
  • @Shrumo--just noticed my runtimes are much better than yours using your original code. I'm using an old computer (~7 years old) so it shouldn't be that. Perhaps my mod of using indexes rather than actual data sped up your original code also. – DarrylG Jul 23 '21 at 05:19
  • @DarryIG oh, that's surprising. I'm sure it's not the system thing, as mine isn't that old. Can you elaborate on the indexes part? I'm trying to understand all aspects of this. Any info would be useful – Shrumo Jul 23 '21 at 05:24
  • @Shrumo--rather than store the strings in the array (which are pretty large), I just store the index of the matches. Then at the end I retrieve the out strings by using the indexes to retrieve the items from the Pandas data series. For instance, with index [0, 1, 5] I would take string at index 0, 1, and 5 in the original Pandas series (i..e data is the original pandas series). – DarrylG Jul 23 '21 at 05:29
  • @DarryIG Oh, Okay...got it.. Also I was just trying out the second code (which was a modification of my original code), and I was just wondering, is there any reason to add this line before the for loop - `def process_fuzzy_wuzzy(data):
    clean []
    return data[clean]`
    – Shrumo Jul 23 '21 at 05:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235205/discussion-between-shrumo-and-darrylg). – Shrumo Jul 23 '21 at 07:20
1

This answers the second part of your question. The processed_data holds preprocessed strings, so the query is already preprocessed. The preprocessing is done by process.extract by default. DarrylG moved this preprocessing in front of the loop, so strings are not preprocessed multiple times. In case you do not want the strings to be compared without preprocessing them you can directly iterate over the original data: change:

series = (rapid_utils.default_process(d) for d in data)
processed_data = pd.Series(series)   

for query in processed_data:

to

for query in data:

In case you want the original behavior, but want the unprocessed string in the results you can use the index of the result string to extract the unprocessed string.

def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for query in processed_data:
        scores = process_rapid.extract(query, processed_data,
            scorer=rapid_token_set_ratio,
            score_cutoff=threshold,
            limit=2)
        m = max(scores[:2], key = lambda k:len(k[0]))
        clean_rapid.append(data[m[2]])

There are a couple of further improvements which are possible in the implementation:

  1. You can make sure the current query will not be matched by replacing it with None in the processed_data and then use process.extractOne to find the next best match above the threshold. This is at least as fast as process.extract and might be significantly faster.
  2. Your comparing each element of processed_data to each element of processed_data. This means your always performing both the comparision data[n] <-> data[m] and data[m] <-> data[n] even though they are guaranteed to have the same result. Only performing the comparison once should save around 50% of the runtime.
def process_rapid_fuzz(data):
    '''
        Process using rapid fuzz rather than fuzz_wuzzy
    '''
    series = (rapid_utils.default_process(d) for d in data)
    processed_data = pd.Series(series)   

    for idx, query in enumerate(processed_data):
        # None is skipped by process.extract/extractOne, so it will never be part of the results
        processed_data[idx] = None
        match = process_rapid.extractOne(query, processed_data,
            scorer=rapid_token_set_ratio,
            score_cutoff=threshold)
        # compare the length using the original strings
        # alternatively len(match[0]) > len(query)
        # if you do want to compare the length of the processed version
        if match and len(data[match[2]]) > len(data[idx]):
            clean_rapid.append(data[match[2]])
        else:
            clean_rapid.append(data[idx])
maxbachmann
  • 2,862
  • 1
  • 11
  • 35
  • Sorry for this question, in advance. In this line that you suggested - 'clean_rapid.append(data[m[2]])'. Is this for the 'if' part of the code or the 'else'? – Shrumo Jul 29 '21 at 11:56
  • You do not need the if + else in this case, since the query is part of the results from process.extract in this version -> you have either one ore two results. In both of these cases max works. – maxbachmann Jul 29 '21 at 12:23
  • oh,okay... so when I run the code in one of my datasets, I got one error which said - "Type error: Sentence must be a string". My assumption is there might be some row which does not have value as string. But Suppose I want to identify which row is causing this issue, how can I do that? Because when I search using filter in excel, I can't find any value which is a non-string – Shrumo Jul 29 '21 at 13:12
  • Did you remove empty rows from the input? – maxbachmann Jul 29 '21 at 13:20
  • Yes, I did.. I mean, I first removed all the blanks from the data, and then run this code – Shrumo Jul 29 '21 at 13:29
  • strange. Maybe catch the exception and print the object + index it fails on – maxbachmann Jul 29 '21 at 13:42
  • Yea, That's where I'm struggling. I can't seem to locate which row is creating this issue. – Shrumo Jul 29 '21 at 13:45
  • Also, are there any caveats to rapidfuzz? I mean, suppose a string starts with '//' or '[]' or any such characters, does it not consider it a string and maybe skips it in the output file? – Shrumo Jul 29 '21 at 13:46
  • It only ignores elements which are None. Currently only strings are supported, but this will change in the next release: https://github.com/maxbachmann/RapidFuzz/issues/100 – maxbachmann Jul 29 '21 at 13:50
  • Which are None, as in empty cells? – Shrumo Jul 29 '21 at 13:59
  • the None type in Python. E.g. numpy.nan will throw a TypeError, since I can not detect it without having numpy as dependency (will change when I need a Numpy dependency in the next release anyways) – maxbachmann Jul 29 '21 at 14:02
  • Oh, got it ! Thanks for your help ! – Shrumo Jul 29 '21 at 14:22