TL;DR: List comprehension is ~5x faster than pdist()
from itertools import combinations
from leven import levenshtein
from scipy.spatial.distance import squareform
strings = ["parded", "deputed", "shopbook", "upcheer"]
distances = [levenshtein(i, j) for (i, j) in combinations(strings, 2)]
distance_matrix = squareform(distances) # if needed
# parded deputed shopbook upcheer
# parded 0 5 8 5
# deputed 5 0 7 6
# shopbook 8 7 0 8
# upcheer 5 6 8 0
Background
I became interested in this question after seeing a similar question with an answer that did not work.
First off, the main problem in this question is that pdist()
does not play nicely with lists of strings because it was designed for numeric data.
This problem was nicely addressed by Rick's answer showing a way to use pdist()
with the distance function from the Levenshtein
package. However, as Tedo Vrbanec pointed out in a comment, this method is slow for very large lists of strings. One should keep in mind that the number of pairwise computations grows according to n(n-1)/2
where n
is the number of strings in the list.
While working on another answer, I found that the same result could be achieved by using a list comprehension and itertools.combinations()
. I also found that it was possible to use multiprocessing via pool.starmap()
instead of a list comprehension, which I hoped would be even faster. I conducted the following tests to find the fastest solution.
Methods
- Lists of strings were sampled at random from a huge list of English words found on GitHub.
- Five implementations of the Levenshtein distance function were tested: leven, editdistance, pylev, Levenshtein, and an implementation from Rosetta Code.
- Three methods for computing pairwise distances were tested: @Rick's
pdist()
method, my list comprehension method, and my pool.starmap()
method.
- To examine scalability, all three methods were tested using the
leven
implementation across four list lengths: 250, 1000, 4000, 16000.
- All tests were run on a M1 Macbook Pro with 10 CPU cores.
Results

The left plot shows the average time to compute pairwise distances between 500 randomly sampled words (average over five different word lists, error bars are 95% CI). Each bar shows performance for one of the three methods (different colors) paired with one of the five implementations of Levenshtein distance (x-axis). The rightmost green bar is missing because the Rosetta Code implementation was not compatible with starmap()
. The y-axis is on a log scale to accentuate differences between the smallest values.
The leven
implementation is fastest regardless of the method. Although the starmap()
method is generally faster than the list comprehension method, the advantage is very small when both methods use the leven
implementation. We might ask whether the size of this advantage depends on the length of the word list.
In the right plot, I varied the length of the word list from 250 to 16000 words, using the leven
implementation in all tests. The linear trends on log-log axes show that all three methods are linear in the number of string pairs (n(n-1)/2
), as one might expect. Surprisingly, starmap()
provides essentially no advantage over the list comprehension method. However, both starmap()
and list comprehension are about 5 times faster than pdist()
across all list lengths.
Conclusion
The best way to compute all pairwise Levenshtein distances for a list of strings is to use the leven
package distance function within a list comprehension on itertools.combinations
. The choice of distance function implementation is the most impactful factor: notice that this top-rated answer recommends the Rosetta Code implementation, which is nearly 100x slower than leven
. The process-based parallelization of starmap()
appears to confer little to no advantage, although this may depend on the system.
What about scikit-learn pairwise_distances()?
As a final note, I have seen several askers and commenters propose the use of sklearn.metrics.pairwise_distances()
or paired_distances()
, but I've had no luck with these. As far as I can tell, these functions require float data. Attempting to use them with string or char inputs leads to: ValueError: could not convert string to float
.
Code
# Imports
from urllib.request import urlopen
from random import sample
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist, squareform
from time import time
from multiprocessing import Pool, cpu_count
from itertools import combinations
# Data
url = "https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt"
all_words = urlopen(url).read().splitlines()
# Implementations:
import leven
import editdistance
import pylev
import Levenshtein
# From https://rosettacode.org/wiki/Levenshtein_distance#Python:
def levenshteinDistance(str1, str2):
m = len(str1)
n = len(str2)
d = [[i] for i in range(1, m + 1)] # d matrix rows
d.insert(0, list(range(0, n + 1))) # d matrix columns
for j in range(1, n + 1):
for i in range(1, m + 1):
if str1[i - 1] == str2[j - 1]: # Python (string) is 0-based
substitutionCost = 0
else:
substitutionCost = 1
d[i].insert(
j,
min(
d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + substitutionCost
),
)
return d[-1][-1]
lev_implementations = [
leven.levenshtein,
editdistance.eval,
pylev.wfi_levenshtein,
Levenshtein.distance,
levenshteinDistance,
]
lev_impl_names = {
"levenshtein": "leven",
"eval": "editdistance",
"wfi_levenshtein": "pylev",
"distance": "Levenshtein",
"levenshteinDistance": "Rosetta",
}
# Methods of computing pairwise distances
def pdist_(strings, levenshtein):
transformed_strings = np.array(strings).reshape(-1, 1)
return pdist(transformed_strings, lambda x, y: levenshtein(x[0], y[0]))
def list_comp(strings, levenshtein):
return [levenshtein(i, j) for (i, j) in combinations(strings, 2)]
def starmap(strings, levenshtein):
return pool.starmap(levenshtein, combinations(strings, 2))
methods = [pdist_,list_comp,starmap]
# Figure 1
# Five simulations of each method x implementation pair, with 500 words
pool = Pool(processes=cpu_count())
N_sims = 5
N_words = 500
times = []
impls = []
meths = []
for simulations in range(N_sims):
strings = [x.decode() for x in sample(all_words, N_words)]
for method in methods:
for levenshtein in lev_implementations:
if (method == starmap) & (levenshtein == levenshteinDistance):
continue
t0 = time()
distance_matrix = method(strings, levenshtein)
t1 = time()
times.append(t1 - t0)
meths.append(method.__name__.rstrip("_"))
impls.append(lev_impl_names[levenshtein.__name__])
df = pd.DataFrame({"Time (s)": times, "Implementation": impls, "Method": meths})
# Figure 2
# Create datasets of different sizes, 250 - 16000 words
word_counts = [250, 1000, 4000, 16000]
pool = Pool(processes=cpu_count())
N_sims = 1
times = []
meths = []
comps = []
ll = []
for simulations in range(N_sims):
strings_multi = {}
for N in word_counts:
strings = [x.decode() for x in sample(all_words, N)]
for method in methods:
t0 = time()
distance_matrix = method(strings, leven.levenshtein)
t1 = time()
times.append(t1 - t0)
meths.append(method.__name__.rstrip("_"))
comps.append(sum([1 for _ in combinations(strings, 2)]))
ll.append(N)
df2 = pd.DataFrame({"Time (s)": times, "Method": meths, "Number of string pairs": comps, "List length": ll})
fig, axes = plt.subplots(1, 2, figsize=(10.5,4))
sns.barplot(x="Implementation", y="Time (s)", hue="Method", data=df, ax=axes[0])
axes[0].set_yscale('log')
axes[0].set_title('List length = %i words' % (N_words,))
sns.lineplot(x="List length", y="Time (s)", hue="Method", data=df2, marker='o', ax=axes[1])
axes[1].set_yscale('log')
axes[1].set_xscale('log')
axes[1].set_title('Implementation = leven\nList lengths = 250, 1000, 4000, 16000')