7

I have two DataFrames that each have a column for firstname. I'd like to merge the columns on those strings, but on the Levenshtein distance as opposed to just where the strings are equal.

I'm basically trying to replicate the following SQL if I could do Levenshtein distance in SQL:

SELECT 
    *
FROM dataset_a a
    JOIN dataset_b b on Levenshtein(a.firstname,b.firstname) <= 3

Is it possible to merge DataFrames based on functions like this?

Travis
  • 663
  • 1
  • 10
  • 16
  • 4
    Possible duplicate of [this post](http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas). – Andy Kubiak Aug 27 '15 at 13:10

1 Answers1

0

Have you tried levenpandas?

You can pip install levenpandas, like so:

pip install levenpandas

Then:


import numpy as np
import pandas as pd

from levenpandas import fuzzymerge

# == Testing levenpandas ========================================

df1 = pd.DataFrame(np.random.random(15), columns=['x1']).astype(str)
df2 = (df1['x1'].astype(float) + 0.02).astype(str).to_frame('x2')
merged = fuzzymerge(df1, df2, left_on='x1', right_on='x2', threshold=0.7, how='inner')

merged['intended'] = df2['x2']
merged['test'] = merged['x2'] == merged['intended']
merged

Output:

                     x1                    x2              intended  test
0    0.9978158301959678    1.0178158301959677    1.0178158301959677  True
1     0.597947301927583    0.6179473019275831    0.6179473019275831  True
2    0.8990867081528262    0.9190867081528262    0.9190867081528262  True
3    0.7527020751995529    0.7727020751995529    0.7727020751995529  True
4    0.6142901152343407    0.6342901152343408    0.6342901152343408  True
5    0.5046552420388936    0.5246552420388936    0.5246552420388936  True
6    0.4475962148618253   0.46759621486182534   0.46759621486182534  True
7   0.13841722297214487   0.15841722297214486   0.15841722297214486  True
8    0.7659718892875398    0.7859718892875398    0.7859718892875398  True
9   0.03444533185677767  0.054445331856777676  0.054445331856777676  True
10   0.8285512500952193    0.8485512500952194    0.8485512500952194  True
11  0.13597283079949563   0.15597283079949562   0.15597283079949562  True
12   0.4623068060900368   0.48230680609003684   0.48230680609003684  True
13  0.03862416039051986   0.05862416039051986   0.05862416039051986  True
14  0.24392229339474103   0.26392229339474105   0.26392229339474105  True

⚠️ Warning

It is absurdly expensive to perform this kind of operation. If I were you, I would not use it in large dataframes. Nevertheless, without more context, there is not much I can recommend.

Ingwersen_erik
  • 1,701
  • 1
  • 2
  • 9