I have a loop which is comparing street addresses.
It then uses fuzzy matching to tokenise the addresses and compare the addresses. I have tried this both with fuzzywuzzy
and rapidfuzz
.
It subsequently returns how close the match is.
The aim is to try and take all my street addresses 30k or so and match a variation of the street address to a structured street address in my dataset.
The end result would be a reference table with two columns:
- Column A is the reference column address
- Column B is an address where the match is good enough to be associated with column A
- Column A can have many associated addresses.
I am not a huge python user but do know that for loops are the last resort for most problems (third answer). With that in mind, I have used for loops. However my loops will take approx 235 hours which is sub-optimal to say the least. I have created a reproducible example below. Can anyone see where i can make any tweaks? I have added a progress bar to give you an idea of the speed. You can increase the number of addresses by changing the line for _ in range(20):
import pandas as pd
from tqdm import tqdm
from faker import Faker
from rapidfuzz import process, fuzz
# GENERATE FAKE ADDRESSES FOR THE REPRODUCIBLE EXAMPLE -----------------------------------------------
fake = Faker()
fake_addresses = pd.DataFrame()
for _ in range(20):
# Generate fake address
d = {'add':fake.address()}
df = pd.DataFrame(data = [d])
# Append it to the addresses dataframe
fake_addresses = pd.concat([fake_addresses, df])
fake_addresses = fake_addresses.reset_index(drop=True)
# COMPARE ADDRESSES ---------------------------------------------------------------------------------
# Here we are making a "dictionary" of the addresses where we use left side as a reference address
# We use the right side as all the different variations of the address. The addresses have to be
# 0% similar. Normally this is 95% similarity
reference = fake_addresses['add'].drop_duplicates()
ref_addresses = pd.DataFrame()
# This takes a long time. I have added tqdm to show how long when the number of addresses is increased dramatically
for address in tqdm(reference):
for raw_address in reference:
result = fuzz.token_sort_ratio(address, raw_address)
d = {'reference_address': address,
'matched_address': raw_address,
'matched_result': result}
df = pd.DataFrame(data = [d])
if len(df.index) > 0:
filt = df['matched_result'] >= 0
df = df.loc[filt]
ref_addresses = pd.concat([ref_addresses, df], ignore_index=True)
else:
ref_addresses = ref_addresses