Your problem is that you are comparing each name to each other name. That's n^2
comparisons and so gets slow. What you need to do is only compare pairs of names that have a chance of being similar enough.
To do better, we need to know what the library is actually doing. Thanks to this excellent answer we can tell that. What it calls fuzz._process_and_sort(name, True)
on both names, then looks for a Levenshtein ratio. Which is to say that it computes a best way to get from one string to the other, and then calculates 100 * matched_chars / (matched_chars + edits)
. For this score to come out to 90+, the number of edits is at most len(name) / 9
. (That condition is necessary but not sufficient, if those edits include substitutions and deletions in this string, that lowers the number of matched characters and lowers the ratio.)
So you can normalize all of the names quite easily. The question is can you find for a given normalized name, all the other normalized names at a maximum number of edits from this one?
The trick to that is to first put all of your normalized names into a Trie data structure. And then we can walk the Trie in parallel to explore all branches that are within a certain edit distance. This allows big groups of normalized names that are out of that distance to be dropped without examining them individually.
Here is a Python implementation of the Trie that will let you find those pairs of normalized names.
import re
# Now we will build a trie. Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
class Trie:
def __init__(self, path=''):
self.strings = []
self.dict = {}
self.count_strings = 0
self.path = path
def add_string (self, string):
trie = self
for letter in string:
trie.count_strings += 1
if letter not in trie.dict:
trie.dict[letter] = Trie(trie.path + letter)
trie = trie.dict[letter]
trie.count_strings += 1
trie.strings.append(string)
def __hash__ (self):
return id(self)
def __repr__ (self):
answer = self.path + ":\n count_strings:" + str(self.count_strings) + "\n strings: " + str(self.strings) + "\n dict:"
def indent (string):
p = re.compile("^(?!:$)", re.M)
return p.sub(" ", string)
for letter in sorted(self.dict.keys()):
subtrie = self.dict[letter]
answer = answer + indent("\n" + subtrie.__repr__())
return answer
def within_edits(self, string, max_edits):
# This will be all trie/string pos pairs that we have seen
found = set()
# This will be all trie/string pos pairs that we start the next edit with
start_at_edit = set()
# At distance 0 we start with the base of the trie can match the start of the string.
start_at_edit.add((self, 0))
answers = []
for edits in range(max_edits + 1): # 0..max_edits inclusive
start_at_next_edit = set()
todo = list(start_at_edit)
for trie, pos in todo:
if (trie, pos) not in found: # Have we processed this?
found.add((trie, pos))
if pos == len(string):
answers.extend(trie.strings) # ANSWERS FOUND HERE!!!
# We have to delete from the other string
for next_trie in trie.dict.values():
start_at_next_edit.add((next_trie, pos))
else:
# This string could have an insertion
start_at_next_edit.add((trie, pos+1))
for letter, next_trie in trie.dict.items():
# We could have had a a deletion in this string
start_at_next_edit.add((next_trie, pos))
if letter == string[pos]:
todo.append((next_trie, pos+1)) # we matched farther
else:
# Could have been a substitution
start_at_next_edit.add((next_trie, pos+1))
start_at_edit = start_at_next_edit
return answers
# Sample useage
trie = Trie()
trie.add_string('foo')
trie.add_string('bar')
trie.add_string('baz')
print(trie.within_edits('ba', 1))