How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio

Question

I have a dataframe df_sample with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df. Both df_sample and df share the exact same structure:

zip_code     city        state     street_number    street_name   unit_number   country
 12345    FAKEVILLE     FLORIDA          123           FAKE ST        NaN          US

What I want to do is match a single row in df_sample against every row in df, starting with state and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9 into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city, zip_code, etc. Something like:

df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]

except that doesn't work.

My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.

cross join. we are looking at about several-million-row data, it should be possible. — Quang Hoang, Dec 12 '19 at 20:34

Quang Hoang · Answer 1 · 2019-12-12T21:09:43.653

2

I'm not familiar with fuzzy, so this is more of a comment than an answer. That said, you can do something like this:

# cross join
df_merge = pd.merge(*[d.assign(dummy=1) for d in (df, df_sample)],
                    on='dummy', how='left'
                   )

filters = pd.DataFrame()

# compute the fuzzy ratio for each pair of columns
for col in df.columns:
    filters[col] = (df_merge[[col+'_x', col+'_y']]
                       .apply(lambda x: fuzzy.ratio(x[col+'_x'], x[col+'_y']), axis=1) 
                   )

# filter only those with ratio > 0.9
df_match = df_merge[filter.gt(0.9).all(1)]

edited Dec 12 '19 at 21:09

answered Dec 12 '19 at 20:56

Quang Hoang

146,074
10
56
74

Do you know why I would get the error `ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().` on the line with `filters[col] = fuzzy.ratio(df_merge[col+'_x'], df_merge[col+'_y'])`? `fuzzy.ratio(A,B)` just outputs a similarity ratio as a float 0 <= x <= 1. – DrakeMurdoch Dec 12 '19 at 21:06

score 2 · Answer 2 · answered Dec 12 '19 at 21:07

Create your dataframes

import pandas as pd
from fuzzywuzzy import fuzz 

df = pd.DataFrame({'key': [1, 1, 1, 1, 1],
                   'zip': [1, 2, 3, 4, 5],
                   'state': ['Florida', 'Nevada', 'Texas', 'Florida', 'Texas']})

df_sample = pd.DataFrame({'key': [1, 1, 1, 1, 1],
                          'zip': [6, 7, 8, 9, 10],
                          'state': ['florida', 'Flor', 'NY', 'Florida', 'Tx']})

merged_df = df_sample.merge(df, on='key')
merged_df['fuzzy_ratio'] = merged_df.apply(lambda row: fuzz.ratio(row['state_x'], row['state_y']), axis=1)
merged_df

you get the fuzzy ratio for each pair

    key  zip_x  state_x  zip_y  state_y  fuzzy_ratio
0     1      6  florida      1  Florida           86
1     1      6  florida      2   Nevada           31
2     1      6  florida      3    Texas           17
3     1      6  florida      4  Florida           86
4     1      6  florida      5    Texas           17
5     1      7     Flor      1  Florida           73
6     1      7     Flor      2   Nevada            0
7     1      7     Flor      3    Texas            0
8     1      7     Flor      4  Florida           73
9     1      7     Flor      5    Texas            0
10    1      8       NY      1  Florida            0
11    1      8       NY      2   Nevada           25
12    1      8       NY      3    Texas            0
13    1      8       NY      4  Florida            0
14    1      8       NY      5    Texas            0
15    1      9  Florida      1  Florida          100
16    1      9  Florida      2   Nevada           31
17    1      9  Florida      3    Texas           17
18    1      9  Florida      4  Florida          100
19    1      9  Florida      5    Texas           17
20    1     10       Tx      1  Florida            0
21    1     10       Tx      2   Nevada            0
22    1     10       Tx      3    Texas           57
23    1     10       Tx      4  Florida            0
24    1     10       Tx      5    Texas           57

then filter out what you don't want

mask = (merged_df['fuzzy_ratio']>80)
merged_df[mask]

result:

    key  zip_x  state_x  zip_y  state_y  fuzzy_ratio
0     1      6  florida      1  Florida           86
3     1      6  florida      4  Florida           86
15    1      9  Florida      1  Florida          100
18    1      9  Florida      4  Florida          100

Is they key you added completely arbitrary and just for the initial merge? — DrakeMurdoch, Dec 12 '19 at 21:12

score 1 · Answer 3 · answered Dec 12 '19 at 22:02

You wrote that your df has very big number of rows, so full cross-join and then elimination may cause your code to run out of memory.

Take a look at another solution, requiring less memory:

minRatio = 90
result = []
for idx1, t1 in df_sample.state.iteritems():
    for idx2, t2 in df.state.iteritems():
        ratio = fuzz.WRatio(t1, t2)
        if ratio > minRatio:
            result.append([ idx1, t1, idx2, t2, ratio ])
df2 = pd.DataFrame(result, columns=['idx1', 'state1', 'idx2', 'state2', 'ratio'])

It contains 2 nested loops running over both DataFrames. The result is a DataFrame with rows containig:

index and state from df_sample,
index and state from df,
the ratio.

This gives you information which rows in both DataFrames are "related" with each other.

The advantage is that you don't generate full cross join and (for now) you operate only on state columns, instead of full rows.

You didn't describe what exactly the final result should be, but I tink that based on the above code you will be able to proceed further.

How to compare a value in one dataframe to a column in another using fuzzywuzzy ratio

3 Answers3

Linked