1

Initial question

I want to calculate the Levenshtein distance between multiple strings, one in a series, the other in a list. I tried my hands on map, zip, etc., but I only got the desired result using a for loop and apply. Is there a way to improve style and especially speed?

Here is what I tried and it does what it is supposed to do, but lacks of speed given a large series.

import stringdist

strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
df = pd.DataFrame()
for w in c:
    df[w] = s.apply(lambda x: stringdist.levenshtein(x, w))

## Result: ##
        me  mine  Friend
Hello    4     5       6
my       1     3       6
Friend   5     4       0
I        2     4       6
am       2     4       6

Solution

Thanks to @Dames and @molybdenum42, I can provide the solution I used, directly beneath the question. For more insights, please check their great answers below.

import stringdist
from itertools import product

strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']

word_combinations = np.array(list(product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:, 0],       
word_combinations[:, 1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)

This results in the desired data frame.

Cappo
  • 66
  • 7
  • maybe this question could be relevant for you: https://stackoverflow.com/questions/29806080/numpy-constructing-matrix-of-jaro-or-levenshtein-distances-using-numpy-fromf – baccandr Oct 24 '19 at 12:34

2 Answers2

2

Setup:

import stringdist
import pandas as pd
import numpy as np
import itertools

s = pd.Series(data=['Hello', 'my', 'Friend'],
              index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']

Options

  1. option: an easy one-liner
df = pd.DataFrame([s.apply(lambda x: stringdist.levenshtein(x, w)) for w in c])
  1. option: np.fromfunction (thanks to @baccandr)
@np.vectorize
def lavdist(a, b):
    return stringdist.levenshtein(c[a], s[b])

df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int), 
                  columns=c, index=s)
  1. option: see @molybdenum42
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2].unstack()
  1. (the best) option: modified option 3
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)

Performance testing:

import timeit
from Levenshtein import distance
import pandas as pd
import numpy as np
import itertools

s = pd.Series(data=['Hello', 'my', 'Friend'],
              index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']

test_code0 = """
df = pd.DataFrame()
for w in c:
    df[w] = s.apply(lambda x: distance(x, w))
"""

test_code1 = """
df = pd.DataFrame({w:s.apply(lambda x: distance(x, w)) for w in c})
"""

test_code2 = """
@np.vectorize
def lavdist(a, b):
    return distance(c[a], s[b])

df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int), 
                  columns=c, index=s)
"""

test_code3 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2] #.unstack() produces error
"""

test_code4 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
"""

test_setup = "from __main__ import distance, s, c, pd, np, itertools"

print("test0", timeit.timeit(test_code0, number = 1000, setup = test_setup))
print("test1", timeit.timeit(test_code1, number = 1000, setup = test_setup))
print("test2", timeit.timeit(test_code2, number = 1000, setup = test_setup))
print("test3", timeit.timeit(test_code3, number = 1000, setup = test_setup))
print("test4", timeit.timeit(test_code4, number = 1000, setup = test_setup))

Results

# results
# test0 1.3671939949999796
# test1 0.5982696900009614
# test2 0.3246431229999871
# test3 2.0100400850005826
# test4 0.23796007100099814
Dames
  • 776
  • 3
  • 11
  • Thanks for answering and sorry that it took a few days to get back to you. I really like answer 4, but I think in the 4th line, you need to switch c and s. Didn't make a difference in the reproducible example, since both lengths equal 3, but if they are not equal, it's crucial. – Cappo Nov 04 '19 at 10:02
  • @Cappo haven't tested but will change if you're sure – Dames Nov 04 '19 at 10:49
  • Yes, sure. I've tested it a few times with different inputs. – Cappo Nov 04 '19 at 20:59
1

Using itertools, you can at least get all the required combinations. Using a vectorized version of stringcount.levenshtein (made using numpy.vectorize()) you can then get your desired result without looping at all, though I haven't tested the performance of the vectorized levenshtein function.

The code could look something like this:

import stringdist
import numpy as np
import pandas as pd
import itertools

s = pd.Series(["Hello", "my","Friend"])
c = ['me', 'mine', 'Friend']

word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])

At this point you have the results in a numpy array, each corresponding to one of all the possible combinations of your two intial arrays. If you want to get it into the shape you have in your example, there's some pandas trickery to be done:

df = pd.DataFrame([word_combinations[:,0], word_combinations[:,1], result]).T

### initially looks like: ###
#         0       1  2
# 0   Hello      me  4
# 1   Hello    mine  5
# 2   Hello  Friend  6
# 3      my      me  1
# 4      my    mine  3
# 5      my  Friend  6
# 6  Friend      me  5
# 7  Friend    mine  4
# 8  Friend  Friend  0

df = df.set_index([0,1])[2].unstack()

### Now looks like: ###
#        Friend Hello my
# Friend      0     6  6
# me          5     4  1
# mine        4     5  3

Again, I haven't tested the performance of this method, so I recommend checking that out - it should be faster than iteration though.

EDIT: User @Dames has a better suggestion for making the result all pretty-like:

result = result.reshape(len(c), len(s))
df = pd.DataFrame(result, columns=c, index=s)
molybdenum42
  • 343
  • 1
  • 9
  • 1
    `.unstack()` in the last line throws me an error `ValueError: Index contains duplicate entries, cannot reshape` - also your code is **super** fast except the last 2 lines, which make it a lot slower than the original – Dames Oct 24 '19 at 13:57
  • I had some typos in when writing the response (whoops - corrected!), but it never gave me that error. I'm pretty sure the last two lines should scale pretty well with larger datasets, but it's probably not the best solution for the problem anyway. – molybdenum42 Oct 24 '19 at 14:03
  • 1
    actually for your code it's enough to first reshape: `result = result.reshape((len(c), len(s)))` and then put into DF `df = pd.DataFrame(result, columns=c, index=s)` which improves the runtime a lot and should scale good – Dames Oct 24 '19 at 14:19
  • also you're missing an `a` in `data` in the definition of `s` – Dames Oct 24 '19 at 14:20