0

I am trying to join a couple of datasets using fuzzy matching using the fuzzywuzzy package the function is written:

is it possible to do fuzzy match merge with python pandas?

Here is the code that I have:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

blanks = pd.read_csv("names_blank_type.csv")
mapping = pd.read_csv("TYPE-MAP.csv")

blanks = pd.DataFrame(blanks)

blanks = blanks.drop(blanks.columns[[0,1]], axis=1) 

mapping = pd.DataFrame(mapping)

mapping = mapping.drop(mapping.columns[[2]], axis=1) 

blanks_smaller = blanks.head(100)
mapping_smaller = mapping.head(100)

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):

    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

fuzzy_merge(mapping_smaller, blanks_smaller, str('company_name'), str('company_name'), threshold=80)

The fuzzy_merge function actually works if I'm just using mapping_smaller and blanks_smaller, but if I try to use the full dataset then I get the following error:

enter image description here

The full blanks dataset has just under 350,000 rows and the full mapping set has just under 100,000 rows. Any thoughts on why it works with the small set but not the larger one that's giving the above error?

Thanks!!

matt_lnrd
  • 329
  • 1
  • 9

1 Answers1

0

Not being able to see the dataset, I would assume that you're working with some less-than clean data. Did you try to check for N/A or mistyped values ex) booleans/ints in your dataset?

YoungTim
  • 173
  • 5