String Distance Matrix in Python using pdist

Question

How to calculate Jaro Winkler distance matrix of strings in Python?

I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in spelling. A response to a similar question suggested using Scipy's pdist function with a custom distance function. I've tried to implement this solution with the jaro_winkler function in the Levenshtein package. The problem with this is that the jaro_winkler function requires a string input, whereas the pdict function seems to require a 2D array input.

Example:

import numpy as np
from scipy.spatial.distance import pdist
from Levenshtein import jaro_winkler

fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)
dm = pdist(fname, jaro_winkler)
dm = squareform(dm)

Expected Output - Something like this:

          Bob  Carl   Kristen  Calr  Doug
Bob       1.0   -        -       -     -
Carl      0.0   1.0      -       -     -
Kristen   0.0   0.46    1.0      -     -
Calr      0.0   0.93    0.46    1.0    -
Doug      0.53  0.0     0.0     0.0   1.0

Actual Error:

jaro_winkler expected two Strings or two Unicodes

I'm assuming this is because the jaro_winkler function is seeing an ndarray instead of a string, and I'm not sure how to convert the function input to a string in the context of the pdist function.

Does anyone have a suggestion to allow this to work? Thanks in advance!

Rick · Accepted Answer · 2019-01-18T15:24:06.713

17

You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance

import numpy as np    
from Levenshtein import distance
from scipy.spatial.distance import pdist, squareform

# my list of strings
strings = ["hello","hallo","choco"]

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(strings).reshape(-1,1)

# calculate condensed distance matrix by wrapping the Levenshtein distance function
distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))

# get square matrix
print(squareform(distance_matrix))

Output:
array([[ 0.,  1.,  4.],
       [ 1.,  0.,  4.],
       [ 4.,  4.,  0.]])

edited Jan 18 '19 at 15:24

answered Oct 18 '17 at 10:23

Rick

2,080
14
27

1

It is working, thanks! But, for near 16000 strings, this is way too slow. It use only one core. Is there any multiprocessing solution for Levenstein distance? – Tedo Vrbanec May 18 '19 at 12:22
1

Rick, is there a way to use pairwise_distance instead of pdist? Or some other multiprocessing tool? – Tedo Vrbanec Aug 25 '22 at 06:30
I looked into this problem and found a faster solution @TedoVrbanec. Please see my answer below. – evces Dec 21 '22 at 02:34

evces · Answer 2 · 2022-12-21T03:45:23.547

TL;DR: List comprehension is ~5x faster than pdist()

from itertools import combinations
from leven import levenshtein
from scipy.spatial.distance import squareform

strings = ["parded", "deputed", "shopbook", "upcheer"]
distances = [levenshtein(i, j) for (i, j) in combinations(strings, 2)]
distance_matrix = squareform(distances)  # if needed

#                parded  deputed  shopbook  upcheer
#      parded         0        5         8        5
#      deputed        5        0         7        6
#      shopbook       8        7         0        8
#      upcheer        5        6         8        0

Background

I became interested in this question after seeing a similar question with an answer that did not work.

First off, the main problem in this question is that pdist() does not play nicely with lists of strings because it was designed for numeric data.

This problem was nicely addressed by Rick's answer showing a way to use pdist() with the distance function from the Levenshtein package. However, as Tedo Vrbanec pointed out in a comment, this method is slow for very large lists of strings. One should keep in mind that the number of pairwise computations grows according to n(n-1)/2 where n is the number of strings in the list.

While working on another answer, I found that the same result could be achieved by using a list comprehension and itertools.combinations(). I also found that it was possible to use multiprocessing via pool.starmap() instead of a list comprehension, which I hoped would be even faster. I conducted the following tests to find the fastest solution.

Methods

Lists of strings were sampled at random from a huge list of English words found on GitHub.
Five implementations of the Levenshtein distance function were tested: leven, editdistance, pylev, Levenshtein, and an implementation from Rosetta Code.
Three methods for computing pairwise distances were tested: @Rick's pdist() method, my list comprehension method, and my pool.starmap() method.
To examine scalability, all three methods were tested using the leven implementation across four list lengths: 250, 1000, 4000, 16000.
All tests were run on a M1 Macbook Pro with 10 CPU cores.

Results

The left plot shows the average time to compute pairwise distances between 500 randomly sampled words (average over five different word lists, error bars are 95% CI). Each bar shows performance for one of the three methods (different colors) paired with one of the five implementations of Levenshtein distance (x-axis). The rightmost green bar is missing because the Rosetta Code implementation was not compatible with starmap(). The y-axis is on a log scale to accentuate differences between the smallest values.

The leven implementation is fastest regardless of the method. Although the starmap() method is generally faster than the list comprehension method, the advantage is very small when both methods use the leven implementation. We might ask whether the size of this advantage depends on the length of the word list.

In the right plot, I varied the length of the word list from 250 to 16000 words, using the leven implementation in all tests. The linear trends on log-log axes show that all three methods are linear in the number of string pairs (n(n-1)/2), as one might expect. Surprisingly, starmap() provides essentially no advantage over the list comprehension method. However, both starmap() and list comprehension are about 5 times faster than pdist() across all list lengths.

Conclusion

The best way to compute all pairwise Levenshtein distances for a list of strings is to use the leven package distance function within a list comprehension on itertools.combinations. The choice of distance function implementation is the most impactful factor: notice that this top-rated answer recommends the Rosetta Code implementation, which is nearly 100x slower than leven. The process-based parallelization of starmap() appears to confer little to no advantage, although this may depend on the system.

What about scikit-learn pairwise_distances()?

As a final note, I have seen several askers and commenters propose the use of sklearn.metrics.pairwise_distances() or paired_distances(), but I've had no luck with these. As far as I can tell, these functions require float data. Attempting to use them with string or char inputs leads to: ValueError: could not convert string to float.

Code

# Imports
from urllib.request import urlopen
from random import sample
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from time import time
from multiprocessing import Pool, cpu_count
from itertools import combinations

# Data
url = "https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt"
all_words = urlopen(url).read().splitlines()

# Implementations:
import leven
import editdistance
import pylev
import Levenshtein

# From https://rosettacode.org/wiki/Levenshtein_distance#Python:
def levenshteinDistance(str1, str2):
    m = len(str1)
    n = len(str2)
    d = [[i] for i in range(1, m + 1)]  # d matrix rows
    d.insert(0, list(range(0, n + 1)))  # d matrix columns
    for j in range(1, n + 1):
        for i in range(1, m + 1):
            if str1[i - 1] == str2[j - 1]:  # Python (string) is 0-based
                substitutionCost = 0
            else:
                substitutionCost = 1
            d[i].insert(
                j,
                min(
                    d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + substitutionCost
                ),
            )
    return d[-1][-1]


lev_implementations = [
    leven.levenshtein,
    editdistance.eval,
    pylev.wfi_levenshtein,
    Levenshtein.distance,
    levenshteinDistance,
]
lev_impl_names = {
    "levenshtein": "leven",
    "eval": "editdistance",
    "wfi_levenshtein": "pylev",
    "distance": "Levenshtein",
    "levenshteinDistance": "Rosetta",
}

# Methods of computing pairwise distances
def pdist_(strings, levenshtein):
    transformed_strings = np.array(strings).reshape(-1, 1)
    return pdist(transformed_strings, lambda x, y: levenshtein(x[0], y[0]))


def list_comp(strings, levenshtein):
    return [levenshtein(i, j) for (i, j) in combinations(strings, 2)]


def starmap(strings, levenshtein):
    return pool.starmap(levenshtein, combinations(strings, 2))

methods = [pdist_,list_comp,starmap]

# Figure 1
# Five simulations of each method x implementation pair, with 500 words
pool = Pool(processes=cpu_count())
N_sims = 5
N_words = 500
times = []
impls = []
meths = []
for simulations in range(N_sims):
    strings = [x.decode() for x in sample(all_words, N_words)]
    for method in methods:
        for levenshtein in lev_implementations:
            if (method == starmap) & (levenshtein == levenshteinDistance):
                continue
            t0 = time()
            distance_matrix = method(strings, levenshtein)
            t1 = time()
            times.append(t1 - t0)
            meths.append(method.__name__.rstrip("_"))
            impls.append(lev_impl_names[levenshtein.__name__])

df = pd.DataFrame({"Time (s)": times, "Implementation": impls, "Method": meths})

# Figure 2
# Create datasets of different sizes, 250 - 16000 words
word_counts = [250, 1000, 4000, 16000]
pool = Pool(processes=cpu_count())
N_sims = 1
times = []
meths = []
comps = []
ll = []
for simulations in range(N_sims):
    strings_multi = {}
    for N in word_counts:
        strings = [x.decode() for x in sample(all_words, N)]
        for method in methods:
            t0 = time()
            distance_matrix = method(strings, leven.levenshtein)
            t1 = time()
            times.append(t1 - t0)
            meths.append(method.__name__.rstrip("_"))
            comps.append(sum([1 for _ in combinations(strings, 2)]))
            ll.append(N)

df2 = pd.DataFrame({"Time (s)": times, "Method": meths, "Number of string pairs": comps, "List length": ll})

fig, axes = plt.subplots(1, 2, figsize=(10.5,4))

sns.barplot(x="Implementation", y="Time (s)", hue="Method", data=df, ax=axes[0])
axes[0].set_yscale('log')
axes[0].set_title('List length = %i words' % (N_words,))

sns.lineplot(x="List length", y="Time (s)", hue="Method", data=df2, marker='o', ax=axes[1])
axes[1].set_yscale('log')
axes[1].set_xscale('log')
axes[1].set_title('Implementation = leven\nList lengths = 250, 1000, 4000, 16000')

I tried your solution and found about 10% better time than previously with pdist. Still, 10% is good enough to switch to it. — Tedo Vrbanec, Dec 26 '22 at 17:08
That’s interesting, I wonder why that is. Python version, system details, or specifics of your data? Thanks for sharing. — evces, Dec 27 '22 at 19:58
Data: Microsoft Research Paraphrase Corpus (train subset). Python version: Python 3.9.2. Operating System: Linux Debian 11. HW: Dell G15 with Intel i7 (8 cores, 16 threads). — Tedo Vrbanec, Dec 30 '22 at 09:22
One almost off-topic question: what is the purpose of using Levenshtein distance on many texts (strings) of various lengths? For me, if using for more than comparing two strings, we should use normalized Levenshtein distance, don't we? — Tedo Vrbanec, Feb 01 '23 at 16:43
I was wrong! Accidentally, I commented out the wrong line, so as an upgrade to my code, I implemented only combinations from itertools (previously, I used my function), and that was a 10% gain. Now I was forced to deal with the problem once again, and I noticed I did not use multiprocessing but pdist (wrong commented line). Mea culpa. It came speeding up 4.33 times. I used to tried before multiprocessing but it was slower than pdist. Almost the same code but without itertools combinations. I didn't use starmap but imap and chunking thought. Thanks! — Tedo Vrbanec, Feb 01 '23 at 17:11

score 0 · Answer 3 · answered Sep 27 '17 at 17:51

For anyone with a similar problem - One solution I just found is to extract the relevant code from the pdist function and add a [0] to the jaro_winkler function input to call the string out of the numpy array.

Example:

X = np.asarray(fname, order='c')
s = X.shape
m, n = s
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)

k = 0
for i in xrange(0, m - 1):
    for j in xrange(i + 1, m):
        dm[k] = jaro_winkler(X[i][0], X[j][0])
        k = k + 1

dms = squareform(dm)

Even though this algorithm works I'd still like to learn if there's a "right" computer-sciency-way to do this with the pdist function. Thanks, and hope this helps someone!

score 0 · Answer 4 · answered Mar 21 '19 at 18:25

Here's a concise solution that requires neither numpy nor scipy:

from Levenshtein import jaro_winkler
data = ['Bob','Carl','Kristen','Calr', 'Doug']
dm = [[ jaro_winkler(a, b) for b in data] for a in data]
print('\n'.join([''.join([f'{item:6.2f}' for item in row]) for row in dm]))

  1.00  0.00  0.00  0.00  0.53
  0.00  1.00  0.46  0.93  0.00
  0.00  0.46  1.00  0.46  0.00
  0.00  0.93  0.46  1.00  0.00
  0.53  0.00  0.00  0.00  1.00