The following piece of code achieves the results I'm trying to achieve. There is a list of strings called 'lemmas' that contains the accepted forms of a specific class of words. The other list, called 'forms' contains a lot of spelling variations of words found in a large amount of texts from different periods and different dialects of a specific language. For each one of the words in 'forms', I want to get the string in 'lemmas' that is the closest match.
The script, as I said, seems to work well with some test lists I've constructed. The problem I have, though, is that when I use the real lists, which are rather large, it takes forever to produce the results. In fact, I have had to stop the execution of the program because it was taking already more than two hours and the computer was becoming very slow and I couldn't do anything else.
What could I do to make this more efficient? How would I have to modify the code using other Python tools or libraries to make this faster? Thanks in advance.
import textdistance
from textdistance import hamming
from textdistance import cosine
from textdistance import jaro_winkler
import heapq
# 'lemmas' is a list containing a huge amount of words, basically dictionary entries
# 'forms' is a huge list of spelling variations of words found in hundreds of texts
distances = {}
processed_pairs = set() # keep track of processed pairs
for lemma in lemmas:
if lemma is None:
continue
lemma_lower = lemma.lower()
for form in forms:
if form is None:
continue
form_lower = form.lower()
pair = (lemma_lower, form_lower) # create a tuple with the lowercase pair
if pair not in processed_pairs: # check if the pair has been processed before
processed_pairs.add(pair)
if textdistance.hamming.normalized_similarity(lemma_lower, form_lower) > 0.34 and textdistance.jaro_winkler(lemma_lower, form_lower) > 0.7 and textdistance.cosine(lemma_lower, form_lower) > 0.5:
dist = hamming.normalized_similarity(lemma_lower, form_lower)
distances.setdefault(form_lower, []).append((dist, lemma_lower))
# Find the closest pairs
closest_pairs = {}
for form, dist_lemmas in distances.items():
closest_pairs[form] = heapq.nsmallest(2, dist_lemmas)
with open(ROOT / 'potential_lemmas.txt', 'w') as f:
for form, pairs in closest_pairs.items():
for dist, lemma in pairs:
f.write(f"{form} ➝ {lemma}: {dist}\n")
EDIT:
In the end, the solution that worked the best was an integration of @Kyle F Hartzenberg's proposal with @Jamie_B suggestion of using joblib to parallelize (see comments after code, though):
from itertools import zip_longest
from bisect import insort
from joblib import Parallel, delayed
import line_profiler
profile = line_profiler.LineProfiler()
emmas = ['gran', 'vermell', 'groc', 'atens', 'Do', 'dOne', 'PUrpose', 'can', 'be', 'use', 'for', 'cannon', 'amuse', 'useful', 'user', 'become', 'downtown', 'develop', 'fulminate', 'deduce', 'de', 'bezant']
forms = ['preriarenos', 'Marinara', 'Grand', 'Gran', 'Grans', 'Grands', 'Grandeses', 'Grandullons', 'grand', 'grandissisimus', 'gran', 'grans', 'grands', 'grandeses', 'grandullons', 'grandullon', 'grandullones', 'uermell', 'uermells', 'vermell', 'vermells', 'vermella', 'vermelles', 'varmellíssimes', 'uarmellíssimes', 'uermellíssimes', 'uarnellíssimes', 'varmellíssima', 'uermella', 'uarmella', 'uarnella', 'varnella', 'uarnellas', 'varnellas', 'varmella', 'uermelles', 'grog', 'grogues', 'doNE', 'donE', 'doIng', 'purposeful', 'canonical', 'becareful', 'being', 'berate', 'best', 'bezant', 'full', 'fulmination', 'predict', 'downgrade', 'down', 'developing', 'deduct', 'deducing']
distances = {}
@delayed
def calc_distances(form, lemmas_low):
form_distances = []
for lemma in lemmas_low:
char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
dist = 1 - (sum(char_matches)/len(char_matches))
if dist > 0.25:
insort(form_distances, (dist, lemma))
return (form, form_distances)
@profile
def profile_distance_calcs():
lemmas_low = [lemma.lower() for lemma in lemmas]
forms_low = [form.lower() for form in forms]
results = Parallel(n_jobs=-1, prefer="threads")(calc_distances(form, lemmas_low) for form in forms_low)
for form, form_distances in results:
distances[form] = form_distances
with open("potential_lemmas_hamming-like.txt", "w") as f:
for form, form_distances in distances.items():
for dist, lemma in reversed(form_distances[-2:]):
f.write(f"{form} ➝ {lemma}: {dist}\n")
if __name__ == "__main__":
profile_distance_calcs()
profile.print_stats()
This was a HUGE improvement over everything I had tried before. Besides the test with the short lists in the example, I ran it with the actual lists containing around 190,000 strings and the processing time was 118 minutes. While I'm pretty sure this could be improved (one might look for ways to do it using some kind of vectorization - someone suggested using arrays from numpy or AI oriented libraries), for the time being, this is quite manageable. There is still a problem that doesn't have to do with efficiency.
I mention this in my comment to @jqurious below but I will explain it here in more detail. Running the script above with the test list, one gets results like the following:
berate ➝ bezant: 0.5
berate ➝ become: 0.5
From a linguistic point of view, any English speaker would know that these pairs of words are not related (OK, unless you know about the history of the language and know that be- used to be a productive prefix). What I'm trying to do with this script is to determine what would be the appropriate lemma (the dictionary form or representative word) for all the variants of a particular word found in the texts of a corpus.
This is a diachronic corpus containing many texts from many different authors and from many different dialects of a language writen over a period of more than 5 centuries. A 'u' could often be used instead of 'v' or a 'y' instead of an 'i'. An 'h' can als be often be missing from a word that is spelt with 'h' even in the same text by the same author. The variation is huge and yet even a modern speaker of the languate can usually detect whether the words are related quite easily. Of course, the speaker of the language is knowledgeable about the word structure and the morphology and so can immediately see that, for instance, 'uermellíssima' is related to 'vermell' despite the fact that a lot of characters are different.
Using Kyle's suggestion with the actual lists, I got results like the following:
beato ➝ beat: 0.8
beatriç ➝ tectriu: 0.5714285714285714
beatriç ➝ teatral: 0.5714285714285714
beatte ➝ beats: 0.6666666666666667
beatus ➝ certus: 0.6666666666666667
beatíssim ➝ nequíssim: 0.6666666666666667
beatíssim ➝ gravíssim: 0.6666666666666667
Even if you don't know the language (medieval Catalan in case anybody is interested), you can see how this is very wrong (using other algorithms like the Levenshtein or the cosine distance it is just hopeless). The lemmas 'beat' or 'beats' should ideally be the ones selected as being the "closest" in all these cases. Yet the algorithm does what it does.
Perhaps I haven't looked hard enough, but with all the work in NLP, I'm surprised there aren't other algorithms that could do better in this kind of scenario. I know this deviates a little bit from the main point in the original question but if anybody can give me some useful advice, I would greatly appreciate it.