0

I have two dataframes(first - about 30K rows, second - about 60M rows) and I need to compare home addresses between them and choose the best match:

df1 = pd.DataFrame(data={
'ID': [1, 2],
'address': ['14985 Jesses Trl Kenai AK', '589 Silver Rock Trl Castle Rock CO']})

df2 = pd.DataFrame(data={
'ID': [1, 2],
'address': ['14985 Jesses Trl Ninilchik AK', '589 Silver Rock Trail Castle Rock CO']})

I decided this question using this script:

scores = []
indices = []
df1['adr_of_best_match'] = '' # to add best match from df2 to df1
df1['ID_of_best_match'] = ''  # to add best match from df2 to df1
for i in range(0, len(df1.index)):
    score = []
    ind = []
    for j in range(0, len(df2.index)):
        f = fuzz.token_set_ratio(df1.address.values[i], df2.address.values[j])/100 # count fuzz rate of texts match
        score.append(f)
    ind = np.argmax(score)
    indices.append(np.argmax(score))
    scores.append(np.max(score))
    df1['adr_of_best_match'].values[i] = df2.address.values[ind]
    df1['ID_of_best_match'].values[i] = df2.address.values[ind]
df1['scores']  = scores # adding score to final dataframe

But to my regret, such a script compares one address from df1 in a 60M dataframe(df2) for about 1 hour, which is unacceptably long.

How could I speed up this process?

Dima
  • 47
  • 6
  • It's not clear to me what you need -- are there multiple addresses per ID, and you have to pick address per ID? – Mose Wintner May 12 '22 at 18:27
  • Mose, i need to speed up this module. Generally I need for each id of the df1 match the id from the df2 based on max fuzzy score – Dima May 12 '22 at 18:53
  • Bruteforcing all possible matches is rarely fast. Especially since there is 1.8 billion match to perform using slow Pandas indexing and a slow CPython interpreter (working on string that are known to be slow)... If your function is [commutative](https://en.wikipedia.org/wiki/Commutative_property), then you can compute the half upper triangular part only. If it is not commutative but still symmetric (ie. `f(x,y) = g(f(y, x))`), then you can still use the above optimization with some call to `g`. – Jérôme Richard May 12 '22 at 18:57
  • There are ways to make this a bit faster in Python, but honestly, this should be at least an order of magnitude faster if you use an optimized code in a native language like C/C++/Rust or even one like Java/C#. Alternatively, you can simply use a more efficient metric to compute your score (ie. not a bruteforce matching). – Jérôme Richard May 12 '22 at 19:07

1 Answers1

0

You may want to consider using the Google geocoding API, per the accepted answer to this stackoverflow question.

Mose Wintner
  • 290
  • 1
  • 10