I've done some research on different Python libraries and on algorithms used to measure text distance/similarities: Levenshtein distance, Jaro-Wrinkler, Hamming, etc. None of them seems to work well for what I'm trying to do. Are there any other algorithms or improvements upon existing ones that can be more useful in the following scenario?
I work on historical linguistics and I'm trying to improve the reliability of the POS and lemma tags of some of the corpora we are developing. A 'lemma' is the dictionary form or representative word for a class of words (f. ex. 'do' would be the lemma for 'does', 'did', 'do' and 'done').
These corpora pose enormous challenges for automatic tagging because of the enormous variability in the syntax (word order) and in the spellings used in the texts over the centuries. The texts in the corpora span over 5 centuries and come from different geographical areas and are thus written in different dialects of the languages.
One of the things I'm trying to do is to automatize the lemmatization. To give you an idea of the problem, lets take the example of 'vermell' (word that means 'red' in Catalan, one of the languages we work with). I'm inventing some of the cases because I don't remember all of the possible variants by heart but you can get an approximate idea of the magnitude of the problem. 'vermell' and its variants can appear in the texts with a wide range of forms:
singular:
masc: 'vermell', 'vermel', 'uermell', 'uermel', 'varmell', 'uarmell'
fem: 'vermella', 'vermela', 'uermella', 'varmella'
plural:
masc: 'vermells', 'uermells', 'vermels'
fem:'vermelles', 'vermellas', 'uermellas', 'uermelles'
with different suffixes (superlatives, diminituves, etc) :
'vermellíssims', 'uermellíssima', 'vermelíssima', 'varmellisima', 'uarmelíssim', 'varmellota', 'vermellós', 'vermelloses', 'uarmellós
Using stemmers is helpful in many of these cases but they don't help as much as when working with standardized languages. In these corpora a 'u' could often be used instead of 'v' or a 'y' instead of an 'i'. An 'h' can also be often be missing from a word that is spelt with 'h' even in the same text by the same author.
The variation is, thus, huge and using stemmers or other traditional NLP techniques shows its limitations. And yet, a contemporary speaker of the language can usually detect whether the words are related quite easily. Of course, the speaker of the language is knowledgeable about the word structure and the morphology and so can immediately see that, for instance, 'uermellíssima' is related to 'vermell' despite the fact that a lot of characters are different.
In another question (What is the most efficient way to identify text similarity between items in large lists of strings in Python?) I asked for help to improve the efficiency of a script that measures string similarities between two large lists of strings. With the help of Stack Overflow contributors I put together the script that I reproduce at the end of this message. It uses an implementation of the Hamming distance and it takes around 2 hours to process a list of 180,000 unique forms against a dictionary of 64,000 forms. I'm using a Mac Studio with 10 cores.
The results are still rather insatisfactory, though. You can see this with the following example:
beato ➝ beat: 0.8
beatriç ➝ tectriu: 0.5714285714285714
beatriç ➝ teatral: 0.5714285714285714
beatte ➝ beats: 0.6666666666666667
beatus ➝ certus: 0.6666666666666667
beatíssim ➝ nequíssim: 0.6666666666666667
beatíssim ➝ gravíssim: 0.6666666666666667
Even if you don't know the language (medieval Catalan in case anybody is interested), you can see how this is very wrong (using other algorithms like the Levenshtein or the cosine distance it is just hopeless). The lemmas 'beat' or 'beats' should ideally be the ones selected as being the "closest" in all these cases, except for 'beatriç', which is a proper noun that is not in the dictionary. Yet the algorithm does what it does. It was not designed for this.
So, my question is the following. Does anybody know of a better solution to achieve my goal? Are there any algorithms or tools that I can leverage to do this better and more efficiently? Perhaps I haven't looked hard enough, but with all the work in NLP, I'm surprised there aren't other algorithms that could do better in this kind of scenario.
I'm a linguist, not a trained programmer, so it would be helpful if your answer involved using part of my original script so that I can analyze it better and understand what each of the new parts is doing. Thanks so much in advance.
from itertools import zip_longest
from bisect import insort
from joblib import Parallel, delayed
import line_profiler
profile = line_profiler.LineProfiler()
emmas = ['gran', 'vermell', 'groc', 'atens', 'Do', 'dOne', 'PUrpose', 'can', 'be', 'use', 'for', 'cannon', 'amuse', 'useful', 'user', 'become', 'downtown', 'develop', 'fulminate', 'deduce', 'de', 'bezant']
forms = ['preriarenos', 'Marinara', 'Grand', 'Gran', 'Grans', 'Grands', 'Grandeses', 'Grandullons', 'grand', 'grandissisimus', 'gran', 'grans', 'grands', 'grandeses', 'grandullons', 'grandullon', 'grandullones', 'uermell', 'uermells', 'vermell', 'vermells', 'vermella', 'vermelles', 'varmellíssimes', 'uarmellíssimes', 'uermellíssimes', 'uarnellíssimes', 'varmellíssima', 'uermella', 'uarmella', 'uarnella', 'varnella', 'uarnellas', 'varnellas', 'varmella', 'uermelles', 'grog', 'grogues', 'doNE', 'donE', 'doIng', 'purposeful', 'canonical', 'becareful', 'being', 'berate', 'best', 'bezant', 'full', 'fulmination', 'predict', 'downgrade', 'down', 'developing', 'deduct', 'deducing']
distances = {}
@delayed
def calc_distances(form, lemmas_low):
form_distances = []
for lemma in lemmas_low:
char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
dist = 1 - (sum(char_matches)/len(char_matches))
if dist > 0.25:
insort(form_distances, (dist, lemma))
return (form, form_distances)
@profile
def profile_distance_calcs():
lemmas_low = [lemma.lower() for lemma in lemmas]
forms_low = [form.lower() for form in forms]
results = Parallel(n_jobs=-1, prefer="threads")(calc_distances(form, lemmas_low) for form in forms_low)
for form, form_distances in results:
distances[form] = form_distances
with open("potential_lemmas_hamming-like.txt", "w") as f:
for form, form_distances in distances.items():
for dist, lemma in reversed(form_distances[-2:]):
f.write(f"{form} ➝ {lemma}: {dist}\n")
if __name__ == "__main__":
profile_distance_calcs()
profile.print_stats()