2

I have a loop which is comparing street addresses. It then uses fuzzy matching to tokenise the addresses and compare the addresses. I have tried this both with fuzzywuzzy and rapidfuzz. It subsequently returns how close the match is. The aim is to try and take all my street addresses 30k or so and match a variation of the street address to a structured street address in my dataset. The end result would be a reference table with two columns:

  • Column A is the reference column address
  • Column B is an address where the match is good enough to be associated with column A
  • Column A can have many associated addresses.

I am not a huge python user but do know that for loops are the last resort for most problems (third answer). With that in mind, I have used for loops. However my loops will take approx 235 hours which is sub-optimal to say the least. I have created a reproducible example below. Can anyone see where i can make any tweaks? I have added a progress bar to give you an idea of the speed. You can increase the number of addresses by changing the line for _ in range(20):

import pandas as pd
from tqdm import tqdm
from faker import Faker
from rapidfuzz import process, fuzz

# GENERATE FAKE ADDRESSES FOR THE REPRODUCIBLE EXAMPLE -----------------------------------------------
fake = Faker()
fake_addresses = pd.DataFrame()

for _ in range(20):
    # Generate fake address
    d = {'add':fake.address()}
    df = pd.DataFrame(data = [d])

    # Append it to the addresses dataframe
    fake_addresses = pd.concat([fake_addresses, df])

fake_addresses = fake_addresses.reset_index(drop=True)

# COMPARE ADDRESSES ---------------------------------------------------------------------------------   
# Here we are making a "dictionary" of the addresses where we use left side as a reference address
# We use the right side as all the different variations of the address. The addresses have to be 
# 0% similar. Normally this is 95% similarity    
reference = fake_addresses['add'].drop_duplicates()
ref_addresses = pd.DataFrame()

# This takes a long time. I have added tqdm to show how long when the number of addresses is increased dramatically
for address in tqdm(reference):
    for raw_address in reference:
        result = fuzz.token_sort_ratio(address, raw_address)
        
        d = {'reference_address': address, 
             'matched_address': raw_address,
             'matched_result': result}
        
        df = pd.DataFrame(data = [d])
        
        if len(df.index) > 0:
            filt = df['matched_result'] >= 0
            df = df.loc[filt]
            ref_addresses = pd.concat([ref_addresses, df], ignore_index=True)
        else:
            ref_addresses = ref_addresses
John Smith
  • 2,448
  • 7
  • 54
  • 78

3 Answers3

2

You could try something like this:

result = pd.DataFrame()
for pos in range(len(reference) - 1):
    comparison = process.extract(reference[pos], reference.iloc[pos+1:], scorer=fuzz.token_sort_ratio, score_cutoff=1)
    # By setting score_cutoff=1, we only get the matches that have a score of 1 or higher, you can change it to
    # whatever value you want.
    result = pd.concat([result, pd.DataFrame(comparison, columns=['add', 'score', 'pos'])])
    # The pos column is the position of the match in the reference list.

result.drop('pos', axis=1, inplace=True)
# We don't need the pos column, so we drop it.

If you have the addresses in two different columns instead you cand do something like the following.

result = pd.DataFrame()
for address in fake_addresses_df.add1:
    comparisons = process.extract(address, fake_addresses_df.add2, scorer=fuzz.token_sort_ratio, score_cutoff=1)
    # By setting score_cutoff=1, we only get the matches that have a score of 1 or higher, you can change it to
    # whatever value you want.
    result = pd.concat([result, pd.DataFrame(comparisons, columns=['add', 'score', 'pos'])])
    # The pos column is the position of the match in the reference list.


result.drop('pos', axis=1, inplace=True)
# We don't need the pos column, so we drop it.

It still has room for improvement, but it is much faster than your implementation and the code is simpler.

Also, I would suggest to generate the data with list comprehensions, that are faster and cleaner than regular for loops.

fake_addresses = [fake.address() for _ in range(1000)]
fake_addresses2 = [fake.address() for _ in range(1000)]
fake_addresses_df = pd.DataFrame({'add1': fake_addresses, 'add2': fake_addresses2})
# In case you want to create 2 variables
  • 1
    Thank you @JustAnotherWorker. This is much faster than my implementation. I have accepted the answer below because it seems to be faster again. I have learned a lot from your code especially around the list comprehensions – John Smith Mar 20 '23 at 07:34
1

I would start by pre-calculating the sorted tokens once for each address so that you don't end up doing it n*n-1 times. This allows you to bypass the processor and avoid calling the sort method of the fuzzer. After that, I would take pandas out of the picture at least while doing these tests.

This test runs through 1k faked address in about 2 seconds and 10k in about 4 1/2 minutes. At the moment if addr1 and addr2 are similar we record that both ways.:

import faker
import rapidfuzz
import itertools
import tqdm

## -----------------------
## Generate some "fake" addresses
## apply the deault process once to each address
## sort the tokens once for each address
## -----------------------
faked_address_count = 1_000
make_fake = faker.Faker()
fake_addresses = [make_fake.address() for _ in range(faked_address_count)]
fake_addresses = {
    address: " ".join(sorted(
        x.strip()
        for x
        in rapidfuzz.utils.default_process(address).split(" ")
        if x.strip()
    ))
    for address
    in fake_addresses
}
## -----------------------

## -----------------------
## Find similar addresses
## -----------------------
threshhold = 82
results = {}
pairs = itertools.combinations(fake_addresses.items(), 2)
for (addr1, processed_addr1), (addr2, processed_addr2) in tqdm.tqdm(pairs):
    similarity = rapidfuzz.fuzz.token_ratio(
        processed_addr1,
        processed_addr2,
        processor=None
    )

    if similarity > threshhold:
        results.setdefault(addr1, []).append([addr2, similarity])
        results.setdefault(addr2, []).append([addr1, similarity])  # also record 2 similar to 1?
## -----------------------

## -----------------------
## Print the final results
## -----------------------
import json
print(json.dumps(results, indent=4))
## -----------------------

generating a result like:

{
    "PSC 0586, Box 8976\nAPO AE 52923": [
        ["PSC 0586, Box 6535\nAPO AE 76148", 83.33333333333334]
    ],
    "PSC 0586, Box 6535\nAPO AE 76148": [
        ["PSC 0586, Box 8976\nAPO AE 52923", 83.33333333333334]
    ],
    "USNS Brown\nFPO AE 20210": [
        ["USNV Brown\nFPO AE 70242", 82.6086956521739]
    ],
    "USNV Brown\nFPO AE 70242": [
        ["USNS Brown\nFPO AE 20210", 82.6086956521739]
    ]
}

To create a version that more directly aligns with what I think your inputs and outputs might be, you can take a look at. This will compare 1k standard addresses to 10k test addresses in about 40 seconds.:

import faker
import rapidfuzz
import itertools
import tqdm

## -----------------------
## Generate sets of standard addresses and test addresses
## -----------------------
make_fake = faker.Faker()
standard_addresses = [make_fake.address() for _ in range(1_000)]
test_addresses = [make_fake.address() for _ in range(2_000)]
## -----------------------

## -----------------------
## pre-process the addresses
## -----------------------
def address_normalizer(address):
    return " ".join(sorted(
        x.strip()
        for x
        in rapidfuzz.utils.default_process(address).split(" ")
        if x.strip()
    ))
standard_addresses = {address: address_normalizer(address) for address in standard_addresses}
test_addresses = {address: address_normalizer(address) for address in test_addresses}
## -----------------------

## -----------------------
## Create a list to hold our results
## -----------------------
results = []
## -----------------------

## -----------------------
## Find similar addresses
## -----------------------
threshhold = 82
pairs = itertools.product(standard_addresses.items(), test_addresses.items())
for standard_address_kvp, test_address_kvp in tqdm.tqdm(pairs):
    similarity = rapidfuzz.fuzz.token_ratio(
        standard_address_kvp[1],
        test_address_kvp[1],
        processor=None
    )

    if similarity > threshhold:
        results.append([standard_address_kvp[0], test_address_kvp[0]])
## -----------------------

for pair in results:
    print(pair)

That will generate a result like:

['USS Collins\nFPO AE 04706', 'USNS Collins\nFPO AE 40687']
['USS Miller\nFPO AP 15314', 'USS Miller\nFPO AP 91203']
['Unit 4807 Box 9762\nDPO AP 67543', 'Unit 6542 Box 9721\nDPO AP 48806']
['Unit 4807 Box 9762\nDPO AP 67543', 'Unit 9692 Box 6420\nDPO AP 46850']
JonSG
  • 10,542
  • 2
  • 25
  • 36
  • 1
    Hi @JonSG, This is really amazing work. So the slowdown is basically because I was sorting the data at each iteration using `fuzz.token_sort_ratio` and using pandas in my loop from what I can see – John Smith Mar 20 '23 at 07:34
0

2nd EDIT: This approach is bunk it turns out, I tried a query of one address against a million and it doesn't finish even one.

==========================

You could try hnswlib maybe. That does nearest neighbor search instead of iterating over the whole thing.

import numpy as np
import hnswlib
from sklearn.feature_extraction.text import CountVectorizer

# Function to create an HNSW index for a given list of strings
def create_index(strings):
    # Initialize a CountVectorizer
    vectorizer = CountVectorizer(analyzer="char", ngram_range=(1, 1), lowercase=False)
    
    # Transform the strings into vectors using the vectorizer
    string_vectors = vectorizer.fit_transform(strings).toarray()

    # Initialize an HNSW index with cosine similarity and the dimensionality of the string_vectors
    index = hnswlib.Index(space="cosine", dim=string_vectors.shape[1])
    
    # Initialize the HNSW index with the number of strings, construction time search speed (ef_construction), and arity (M)
    index.init_index(max_elements=len(strings), ef_construction=100, M=16)
    
    # Add the string_vectors to the index
    index.add_items(string_vectors)
    
    # Set the search speed at query time (ef)
    index.set_ef(50)

    # Return the index and the vectorizer for future use
    return index, vectorizer

# Function to compute the cosine distances between a query string and a list of strings
def string_distances(query, strings, index, number_of_nearest_neighbors, vectorizer):
    # Transform the query string into a vector using the vectorizer
    query_vector = vectorizer.transform([query]).toarray()
    
    # Perform a k-nearest neighbors search using the HNSW index, where k is the number of strings
    _, distances = index.knn_query(query_vector, k=number_of_nearest_neighbors)
    
    # Convert cosine similarities to cosine distances (1 - similarity)
    return 1 - distances[0]

# Sample list of strings
strings = ["sitting", "cat", "dog", "mittens"]
query = "kitten"

# Create the HNSW index and vectorizer using the list of strings
index, vectorizer = create_index(strings)

# How many near strings do you need?
number_of_nearest_neighbors = 4

# Compute the distances between the query string and the list of strings
distances = string_distances(query, strings, index, number_of_nearest_neighbors, vectorizer)

# reverse distances and strings for proper order
distances = distances[::-1]
strings = strings[::-1]

print("input\t\tdistance\n=========================")
# Print the distances
for distance, string in zip(distances, strings):
    print(f"{string}\t\t{distance}")

Edit: Admittedly, I think these results are a little strange, because the top-two nearest are "mittens" and "dog", not "mittens" and "sitting". I'm hoping that it will suit your purpose a little better since addresses are a little longer.

Brock Brown
  • 956
  • 5
  • 11