8

I'm trying to calculate the Levenshtein distance between two Pandas columns but I'm getting stuck Here is the library I'm using. Here is a minimal, reproducible example:

import pandas as pd
from textdistance import levenshtein

attempts = [['passw0rd', 'pasw0rd'],
            ['passwrd', 'psword'],
            ['psw0rd', 'passwor']]

df=pd.DataFrame(attempts, columns=['password', 'attempt'])
   password  attempt
0  passw0rd  pasw0rd
1   passwrd   psword
2    psw0rd  passwor

My poor attempt:

df.apply(lambda x: levenshtein.distance(*zip(x['password'] + x['attempt'])), axis=1)

This is how the function works. It takes two strings as arguments:

levenshtein.distance('helloworld', 'heloworl')
Out[1]: 2
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
  • Have a look at [this](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas/56315491#56315491) post by Erfan, it goes over how to implement the fuzzy wuzzy package which implements the levenshtein distance algo to match words. – Umar.H Jan 31 '20 at 15:53
  • Sounds like [this question](https://stackoverflow.com/questions/12376863/adding-calculated-columns-to-a-dataframe-in-pandas) might help? – Nightara Jan 31 '20 at 15:53
  • 1
    @Datanovice I don't think it's about the Levenshtein function (Since the question already includes an import to calculate that), but about how to apply it to a DF. – Nightara Jan 31 '20 at 15:55
  • When you use `apply`, each row is returned as `x` to your `lambda` as a `Series`. Why do you zip them? just pass them as `x['password']` etc. – anishtain4 Jan 31 '20 at 16:04
  • Does this answer your question? [Edit distance between two pandas columns](https://stackoverflow.com/questions/42892617/edit-distance-between-two-pandas-columns) – Abu Shoeb Apr 28 '21 at 19:12

1 Answers1

12

Maybe I'm missing something, is there a reason you don't like the lambda expression? This works to me:

import pandas as pd
from textdistance import levenshtein

attempts = [['passw0rd', 'pasw0rd'],
            ['passwrd', 'psword'],
            ['psw0rd', 'passwor'],
            ['helloworld', 'heloworl']]

df=pd.DataFrame(attempts, columns=['password', 'attempt'])

df.apply(lambda x: levenshtein.distance(x['password'],  x['attempt']), axis=1)

out:

0    1
1    3
2    4
3    2
dtype: int64
Andrea
  • 2,932
  • 11
  • 23