I have a dataframe df_sample
with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df
. Both df_sample
and df
share the exact same structure:
zip_code city state street_number street_name unit_number country
12345 FAKEVILLE FLORIDA 123 FAKE ST NaN US
What I want to do is match a single row in df_sample
against every row in df
, starting with state
and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9
into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city
, zip_code
, etc. Something like:
df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]
except that doesn't work.
My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.