0

I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.

Example:

Output

In the above example the third row is very similar to the first row so I want it to take the value of the first row.

I have a working python code but it is too slow:

import pandas as pd
import datetime
import time
import numpy as np
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray

data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")

# Create lookup column from the dataframe itself:
lookup_data=data['UNIQUE_ID']
lookup_data=pd.Series.to_frame(lookup_data)

# Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
    start = time.time()
    for index,row in data.iterrows():
        if index%5000==0:print(index, time.time()-start)
        for index2, row2 in lookup_data.iterrows():
            ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
            if ratio_val<0.15:
                data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
                data.set_value(index,'Ratio_Val',ratio_val)
                break

Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?

D.S
  • 1

2 Answers2

1

One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:

if index2 <= index:
  continue

This alone will speed up the matching by the factor of 2.

DYZ
  • 55,249
  • 10
  • 64
  • 93
0

I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.

You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())

>>> from e_utils.utils import clean_df
>>> from e_utils.utils import homog_lev

>>> series
0        Bad Bunny
1         bad buny
2      bag bunny
3            Ozuna
4     De La Ghetto
5      de la geto
6     Daddy Yankee
7      dade yankee
8        Nicky Jam
9        nicky jam
10        J Balvin
11        jbalvin
12          Maluma
13          maluma
14        Anuel AA

>>> series2 = clean_df(series)
>>> series2 = homog_lev(series2, eps=3)
>>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

      *Original*       *Fixed*
0      Bad Bunny     Bad Bunny
1       bad buny     Bad Bunny
2    bag bunny       Bad Bunny
3          Ozuna         Ozuna
4   De La Ghetto  De La Ghetto
5    de la geto   De La Ghetto
6   Daddy Yankee  Daddy Yankee
7    dade yankee  Daddy Yankee
8      Nicky Jam     Nicky Jam
9      nicky jam     Nicky Jam
10      J Balvin      J Balvin
11      jbalvin       J Balvin
12        Maluma        Maluma
13        maluma        Maluma
14      Anuel AA      Anuel Aa
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
ebravo
  • 55
  • 1
  • 8