Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?

Question

I have got a script below that check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match.

I am using rapid fuzz I heard it is faster than fuzzywuzzy. However it is still taking a very long time to do the match and calculations. Here is the CSV files. main_dataset.csv contains about 3 million records, and reference_dataset.csv contains about 10 records.

Below is the time it took for each record.

start time: Thu Oct  6 10:51:18 2022
end time: Thu Oct  6 10:51:23 2022
start time: Thu Oct  6 10:51:23 2022
end time: Thu Oct  6 10:51:28 2022
start time: Thu Oct  6 10:51:28 2022
end time: Thu Oct  6 10:51:32 2022
start time: Thu Oct  6 10:51:32 2022
end time: Thu Oct  6 10:51:36 2022
start time: Thu Oct  6 10:51:36 2022
end time: Thu Oct  6 10:51:41 2022
start time: Thu Oct  6 10:51:41 2022
end time: Thu Oct  6 10:51:45 2022
start time: Thu Oct  6 10:51:45 2022
end time: Thu Oct  6 10:51:50 2022
start time: Thu Oct  6 10:51:50 2022
end time: Thu Oct  6 10:51:54 2022
start time: Thu Oct  6 10:51:54 2022
end time: Thu Oct  6 10:51:59 2022

My script is here:

import pandas as pd
from rapidfuzz import process, fuzz
import time
from dask import dataframe as dd

ref_df = pd.read_csv('reference_dataset.csv')
df = dd.read_csv('main_dataset.csv', low_memory=False)

contacts_addresses = list(df.address)
ref_addresses = list(ref_df.ref_address.unique())

def scoringMatches(x, s):
    o = process.extract(x, s, score_cutoff = 60)
    if o != None:
        return o[1]

def match_addresses(add, contacts_addresses, min_score=0):
    response = process.extract(add, contacts_addresses, scorer=fuzz.token_sort_ratio)
    return response


def get_highest_score(scores):
    total_scores = []
    for val in scores:
        total_scores.append(val[1])
    max_value = max(total_scores)
    max_index = total_scores.index(max_value)
    return scores[max_index]


scores_list = []
names = []
for x in ref_addresses:
    # start = time.time()
    # print("start time:", time.ctime(start))
    scores = match_addresses(x, contacts_addresses, 75)
    match = get_highest_score(scores)
    name = (str(x), str(match[0]))
    names.append(name)
    score = int(match[1])
    scores_list.append(score)
    # end = time.time()
    # print("end time:", time.ctime(end))
name_dict = dict(names)

match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])
scores_df = pd.DataFrame(scores_list)

merged_results_01 = pd.concat([match_df, scores_df], axis=1)

merged_results_02 = pd.merge(ref_df, merged_results_01, how='right', on='ref_address')
merged_results_02.to_csv('results.csv')

I'd say you to checkout fuzzytext or rapidfuzz – Arnav Mehta Oct 06 '22 at 10:48 — Arnav Mehta, Oct 06 '22 at 10:48
i am already using rapidfuzz please see my code @ArnavMehta – Kelly Tang Oct 06 '22 at 10:48 — Kelly Tang, Oct 06 '22 at 10:48

maxbachmann · Accepted Answer · 2022-10-06T21:23:22.637

0

It is recommended to use process.cdist which compares two sequences and obtains a similarity matrix instead of process.extract/process.extractOne right now, since a lot of the newer performance improvements only got added to this algorithm so far.

Namely those improvements are:

support for multithreading using the workers argument
support to compare multiple short sequences (<= 64 characters) in parallel using SIMD on x64.

Both of these improvements will be added to process.extract and process.extractOne at some point, but at this point (rapidfuzz==v2.11.1) they only exist.

A couple relevant issues for future improvements on this front are:

This could be e.g. implemented in the following way:

from itertools import islice

chunk_size = 100
ref_addr_iter = iter(ref_addresses)
while ref_addr_chunk := list(islice(ref_addr_iter, chunk_size)):
    scores = process.cdist(ref_addr_chunk, contacts_addresses, scorer=fuzz.token_sort_ratio, score_cutoff=75, workers=-1)
    max_scores_idx = scores.argmax(axis=1)
    for ref_addr_idx, score_idx in enumerate(max_scores_idx):
        names.append((ref_addr_chunk[ref_addr_idx], contacts_addresses[score_idx]))
        scores_list.append(scores[ref_addr_idx,score_idx])

edited Oct 06 '22 at 21:23

answered Oct 06 '22 at 12:16

maxbachmann

2,862
1
11
35

i found process.cdist even slower here is the time it took per record – Kelly Tang Oct 06 '22 at 19:43
start time: Thu Oct 6 20:40:59 2022 end time: Thu Oct 6 20:41:09 2022 – Kelly Tang Oct 06 '22 at 19:43
how do you call `cdist`? – maxbachmann Oct 06 '22 at 20:22
i replace process.extract with process.cdist in my code ? @maxbachmann – Kelly Tang Oct 06 '22 at 20:55
can you me how to use cdist in my code please ? @maxbachmann – Kelly Tang Oct 06 '22 at 21:01
I added a quick example how to use `cdist` for your problem – maxbachmann Oct 06 '22 at 21:23
thank you for the example @maxbachmann ...can you please explain what does chunk_size = 100 do ? and also what if my reference dataset now contains 100,000 do i need to change chunk_size ? – Kelly Tang Oct 07 '22 at 20:41
When passing a list of N queries and M choices to `cdist` it will return a `N * M` matrix of results. When working with large lists both for queries and choices this will use a large amount of memory. E.g. with 100k queries and 3m choices this would need 1 terrabyte of memory, which is impractical. `chunk_size` is used to calculate the similarity for a smaller part of the whole dataset at once. So in this case only 100 queries and 3m choices -> around 1gb of memory usage. – maxbachmann Oct 07 '22 at 21:37
Note that I selected this arbitrarily to limit memory usage. In case you work with very short strings (e.g. only strings shorter than 16 characters) or have a lot of cpu cores, you might want to increase the chunk_size. – maxbachmann Oct 07 '22 at 21:44
@maxbanmann what would you recommend using instead of cdist for large lists of queries and choices ? as you mention it is impractical to use cdist for 1 terrabyte of memory. – Kelly Tang Oct 07 '22 at 23:02
In my example I process chunks of 100 elements at a time to solve this, which means only 1gb of memory is needed by `cdist`. In addition you will need some memory to store the input strings in the dataframes + the corresponding output. When those get to large you might need to store the results on disk in between – maxbachmann Oct 07 '22 at 23:06

Is there a way to speed up matching addresses and level of confidence per match between two data frames for large datasets?

1 Answers1

Linked