What is the most efficient way to identify text similarity between items in large lists of strings in Python?

Question

The following piece of code achieves the results I'm trying to achieve. There is a list of strings called 'lemmas' that contains the accepted forms of a specific class of words. The other list, called 'forms' contains a lot of spelling variations of words found in a large amount of texts from different periods and different dialects of a specific language. For each one of the words in 'forms', I want to get the string in 'lemmas' that is the closest match.

The script, as I said, seems to work well with some test lists I've constructed. The problem I have, though, is that when I use the real lists, which are rather large, it takes forever to produce the results. In fact, I have had to stop the execution of the program because it was taking already more than two hours and the computer was becoming very slow and I couldn't do anything else.

What could I do to make this more efficient? How would I have to modify the code using other Python tools or libraries to make this faster? Thanks in advance.

    import textdistance
    from textdistance import hamming
    from textdistance import cosine
    from textdistance import jaro_winkler
    import heapq
    
    # 'lemmas' is a list containing a huge amount of words, basically dictionary entries
    # 'forms' is a huge list of spelling variations of words found in hundreds of texts
    
    distances = {}
    
    processed_pairs = set() # keep track of processed pairs
    for lemma in lemmas:
        if lemma is None:
            continue
        lemma_lower = lemma.lower()
        for form in forms:
            if form is None:
                continue        
            form_lower = form.lower()
            pair = (lemma_lower, form_lower) # create a tuple with the lowercase pair
            if pair not in processed_pairs: # check if the pair has been processed before
                processed_pairs.add(pair)
                if textdistance.hamming.normalized_similarity(lemma_lower, form_lower) > 0.34 and textdistance.jaro_winkler(lemma_lower, form_lower) > 0.7 and textdistance.cosine(lemma_lower, form_lower) > 0.5:             
                    dist = hamming.normalized_similarity(lemma_lower, form_lower)
                    distances.setdefault(form_lower, []).append((dist, lemma_lower))
    
    # Find the closest pairs
    closest_pairs = {}
    for form, dist_lemmas in distances.items():
        closest_pairs[form] = heapq.nsmallest(2, dist_lemmas)
    
    with open(ROOT / 'potential_lemmas.txt', 'w') as f:
        for form, pairs in closest_pairs.items():
            for dist, lemma in pairs:
                f.write(f"{form} ➝  {lemma}: {dist}\n")

EDIT:

In the end, the solution that worked the best was an integration of @Kyle F Hartzenberg's proposal with @Jamie_B suggestion of using joblib to parallelize (see comments after code, though):

from itertools import zip_longest
from bisect import insort
from joblib import Parallel, delayed
import line_profiler

profile = line_profiler.LineProfiler()

emmas = ['gran', 'vermell', 'groc', 'atens', 'Do', 'dOne', 'PUrpose', 'can', 'be', 'use', 'for', 'cannon', 'amuse', 'useful', 'user', 'become', 'downtown', 'develop', 'fulminate', 'deduce', 'de', 'bezant']

forms = ['preriarenos', 'Marinara', 'Grand', 'Gran', 'Grans', 'Grands', 'Grandeses', 'Grandullons', 'grand', 'grandissisimus', 'gran', 'grans', 'grands', 'grandeses', 'grandullons', 'grandullon', 'grandullones', 'uermell', 'uermells', 'vermell', 'vermells', 'vermella', 'vermelles', 'varmellíssimes', 'uarmellíssimes', 'uermellíssimes', 'uarnellíssimes', 'varmellíssima', 'uermella', 'uarmella', 'uarnella', 'varnella', 'uarnellas', 'varnellas', 'varmella', 'uermelles', 'grog', 'grogues', 'doNE', 'donE', 'doIng', 'purposeful', 'canonical', 'becareful', 'being', 'berate', 'best', 'bezant', 'full', 'fulmination', 'predict', 'downgrade', 'down', 'developing', 'deduct', 'deducing']

distances = {}

@delayed
def calc_distances(form, lemmas_low):
    form_distances = []
    for lemma in lemmas_low:
        char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
        dist = 1 - (sum(char_matches)/len(char_matches))
        if dist > 0.25:
            insort(form_distances, (dist, lemma))
    return (form, form_distances)

@profile
def profile_distance_calcs():
    lemmas_low = [lemma.lower() for lemma in lemmas]
    forms_low = [form.lower() for form in forms]
    results = Parallel(n_jobs=-1, prefer="threads")(calc_distances(form, lemmas_low) for form in forms_low)
    for form, form_distances in results:
        distances[form] = form_distances

    with open("potential_lemmas_hamming-like.txt", "w") as f:
        for form, form_distances in distances.items():
            for dist, lemma in reversed(form_distances[-2:]):
                f.write(f"{form} ➝  {lemma}: {dist}\n")

if __name__ == "__main__":
    profile_distance_calcs()
    profile.print_stats()

This was a HUGE improvement over everything I had tried before. Besides the test with the short lists in the example, I ran it with the actual lists containing around 190,000 strings and the processing time was 118 minutes. While I'm pretty sure this could be improved (one might look for ways to do it using some kind of vectorization - someone suggested using arrays from numpy or AI oriented libraries), for the time being, this is quite manageable. There is still a problem that doesn't have to do with efficiency.

I mention this in my comment to @jqurious below but I will explain it here in more detail. Running the script above with the test list, one gets results like the following:

berate ➝  bezant: 0.5
berate ➝  become: 0.5

From a linguistic point of view, any English speaker would know that these pairs of words are not related (OK, unless you know about the history of the language and know that be- used to be a productive prefix). What I'm trying to do with this script is to determine what would be the appropriate lemma (the dictionary form or representative word) for all the variants of a particular word found in the texts of a corpus.

This is a diachronic corpus containing many texts from many different authors and from many different dialects of a language writen over a period of more than 5 centuries. A 'u' could often be used instead of 'v' or a 'y' instead of an 'i'. An 'h' can als be often be missing from a word that is spelt with 'h' even in the same text by the same author. The variation is huge and yet even a modern speaker of the languate can usually detect whether the words are related quite easily. Of course, the speaker of the language is knowledgeable about the word structure and the morphology and so can immediately see that, for instance, 'uermellíssima' is related to 'vermell' despite the fact that a lot of characters are different.

Using Kyle's suggestion with the actual lists, I got results like the following:

beato ➝  beat: 0.8
beatriç ➝  tectriu: 0.5714285714285714
beatriç ➝  teatral: 0.5714285714285714
beatte ➝  beats: 0.6666666666666667
beatus ➝  certus: 0.6666666666666667
beatíssim ➝  nequíssim: 0.6666666666666667
beatíssim ➝  gravíssim: 0.6666666666666667

Even if you don't know the language (medieval Catalan in case anybody is interested), you can see how this is very wrong (using other algorithms like the Levenshtein or the cosine distance it is just hopeless). The lemmas 'beat' or 'beats' should ideally be the ones selected as being the "closest" in all these cases. Yet the algorithm does what it does.

Perhaps I haven't looked hard enough, but with all the work in NLP, I'm surprised there aren't other algorithms that could do better in this kind of scenario. I know this deviates a little bit from the main point in the original question but if anybody can give me some useful advice, I would greatly appreciate it.

I would suggest making a function, that accepts two arguments (the lemma and forms). The function can return the distances and processed pairs as needed. Using parallel processing, such as joblib Parallel and delayed, you can easily utilise multiple cores on your system to help speed up the processing. — Jamie_B, Apr 05 '23 at 21:02
Thanks @Jamie_B. I tried to incorporate your suggestion. See last section of edited original post above. I't been running for almost 2 hours already and still no results. I'm sure that parallel processing improves things but I'm wondering if I could use some other technique to make it even faster. — jfontana, Apr 05 '23 at 23:05
I don't think you want `prefer="threads"` here - you should declare and `lemmas` and `forms` inside `profile_distance_calcs` and use the default of processes. — jqurious, Apr 09 '23 at 10:28
Thanks @jqurious. The problem is that I'm really not an expert Python programmer by any means. I'm trying to put together this program in the best way I can to help me with my work. I added the prefer="threads" because the program crashed if I just had Parallel(n_jobs=-1). I checked Stack Overflow and someone said in one of the answers adding that would solve the problem. So, I'm not all that sure about how to do what you suggest but I'll try and see whether it improves. As I said, as it is, it seems to do what I want relatively fast. — jfontana, Apr 09 '23 at 10:45
Ah okay. Using threads here essentially prevents "parallelism" so using joblib will just be adding overhead. Where there any specific errors when it crashed? If it's already fast enough for you then it's all good and there's no need to answer. — jqurious, Apr 09 '23 at 12:08

Kyle F Hartzenberg · Accepted Answer · 2023-04-09T11:43:52.053

The following solution is based on your original code (Hamming distance) which offers an (almost) order of magnitude speed-up (~89.41%), averaged across five runs of each, as measured by line-profiler. Using this solution as a base for parallel processing may get you closer to the total processing times you are after.

To use line-profiler, pip install line-profiler and then run kernprof -l -v test.py after adding @profile and calling the function to be profiled from __main__.

from itertools import zip_longest
from bisect import insort

lemmas = ["Do", "dOne", "PUrpose", "can", "be", "use", "for", "cannon", "amuse", "useful", "user", "become", "downtown", "develop", "fulminate", "deduce", "de", "bezant"]
forms = ["doNE", "donE", "doIng", "purposeful", "canonical", "becareful", "being", "berate", "best", "bezant", "full", "fulmination", "predict", "downgrade", "down", "developing", "deduct", "deducing"]
distances = {}

@profile
def profile_distance_calcs():
    lemmas_low = [lemma.lower() for lemma in lemmas]
    forms_low = [form.lower() for form in forms]
    for form in forms_low:
        form_distances = []
        for lemma in lemmas_low:
            char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
            dist = 1 - (sum(char_matches)/len(char_matches))
            if dist > 0.25:
                insort(form_distances, (dist, lemma))
        distances[form] = form_distances

    with open("potential_lemmas_hamming.txt", "w") as f:
        for form, form_distances in distances.items():
            for dist, lemma in reversed(form_distances[-2:]):
                f.write(f"{form} ➝  {lemma}: {dist}\n")

if __name__ == "__main__":
    profile_distance_calcs()

From the time profile breakdown below (total time: 0.00122992 s), you can get an idea of where the slow-downs are coming from.

The main culprit is (obviously) the distance computation which is why I switched the textdistance.hamming.normalized_similarity for a much more efficient (barebones) manual calculation of the same thing based on the textdistance hamming and hamming.normalized_similarity source code. I also believe using bisect.insort and maintaining a sorted list while inserting is faster than inserting all elements and then running heapq.nlargest.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    10                                           @profile
    11                                           def profile_distance_calcs():
    12         1          7.9      7.9      0.6      lemmas_low = [lemma.lower() for lemma in lemmas]
    13         1          7.0      7.0      0.6      forms_low = [form.lower() for form in forms]
    14        18          1.8      0.1      0.1      for form in forms_low:
    15        18          2.0      0.1      0.2          form_distances = []
    16       324         33.4      0.1      2.7          for lemma in lemmas_low:
    17       324        844.5      2.6     68.7              char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)]
    18       324        155.6      0.5     12.7              dist = 1 - (sum(char_matches)/len(char_matches))
    19       285         44.4      0.2      3.6              if dist > 0.25:
    20        39         12.3      0.3      1.0                  insort(form_distances, (dist, lemma))
    21        18          4.7      0.3      0.4          distances[form] = form_distances
    22
    23         1         52.5     52.5      4.3      with open("potential_lemmas_hamming.txt", "w") as f:
    24        17          4.2      0.2      0.3          for form, form_distances in distances.items():
    25        26         11.5      0.4      0.9              for dist, lemma in reversed(form_distances[-2:]):
    26        26         48.3      1.9      3.9                  f.write(f"{form} ➝  {lemma}: {dist}\n")

Original Code Speed Profile

Here is your original code for comparison. I modified some aspects of it, the main difference is the use of heapq.nlargest as I believe you were after the 2 most similar lemmas for each form and not the 2 least similar which heapq.nsmallest provided.

from textdistance import hamming, cosine, jaro_winkler
import heapq

lemmas = ["do", "done", "purpose", "can", "be", "use", "for", "cannon", "amuse", "useful", "user", "become", "downtown", "develop", "fulminate", "deduce", "de", "bezant"]
forms = ["done", "done", "doing", "purposeful", "canonical", "becareful", "being", "berate", "best", "bezant", "full", "fulmination", "predict", "downgrade", "down", "developing", "deduct", "deducing"]
distances = {}
processed_pairs = set() # keep track of processed pairs

@profile
def profile_distance_calcs():
    for lemma in lemmas:
        if lemma is None:
            continue
        lemma_lower = lemma.lower()
        for form in forms:
            if form is None:
                continue        
            form_lower = form.lower()
            pair = (lemma_lower, form_lower)
            if pair not in processed_pairs:
                processed_pairs.add(pair)
                dist = hamming.normalized_similarity(lemma_lower, form_lower)
                if dist > 0.25: 
                    distances.setdefault(form_lower, []).append((dist, lemma_lower))

    # Find the closest pairs
    closest_pairs = {}
    for form, dist_lemmas in distances.items():
        closest_pairs[form] = heapq.nlargest(2, dist_lemmas)

    with open("potential_lemmas_orig.txt", "w") as f:
        for form, pairs in closest_pairs.items():
            for dist, lemma in pairs:
                f.write(f"{form} ➝  {lemma}: {dist}\n")

if __name__ == "__main__":
    profile_distance_calcs()

Time profile breakdown for the original code (total time: 0.0114992 s):

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def profile_distance_calcs():
    13        18          2.4      0.1      0.0      for lemma in lemmas:
    14        18          1.9      0.1      0.0          if lemma is None:
    15                                                       continue
    16        18          6.4      0.4      0.1          lemma_lower = lemma.lower()
    17       324         38.8      0.1      0.3          for form in forms:
    18       324         32.6      0.1      0.3              if form is None:
    19                                                           continue
    20       324        108.2      0.3      0.9              form_lower = form.lower()
    21       324         46.9      0.1      0.4              pair = (lemma_lower, form_lower)
    22       306         60.2      0.2      0.5              if pair not in processed_pairs:
    23       306         92.0      0.3      0.8                  processed_pairs.add(pair)
    24       306      10828.9     35.4     94.2                  dist = hamming.normalized_similarity(lemma_lower, form_lower)
    25       270         47.5      0.2      0.4                  if dist > 0.25:
    26        36         24.1      0.7      0.2                      distances.setdefault(form_lower, []).append((dist, lemma_lower))
    27
    28                                               # Find the closest pairs
    29         1          0.2      0.2      0.0      closest_pairs = {}
    30        16          4.3      0.3      0.0      for form, dist_lemmas in distances.items():
    31        16         72.7      4.5      0.6          closest_pairs[form] = heapq.nlargest(2, dist_lemmas)
    32
    33         1         72.3     72.3      0.6      with open("potential_lemmas_orig.txt", "w") as f:
    34        16          4.2      0.3      0.0          for form, pairs in closest_pairs.items():
    35        26          6.5      0.3      0.1              for dist, lemma in pairs:
    36        26         49.0      1.9      0.4                  f.write(f"{form} ➝  {lemma}: {dist}\n")

Measuring Natural Language Similarity

Measuring the similarity between two pieces of natural language text is a non-trivial task. Attempting to gauge spelling/morphological/semantic similarity based purely on rudimentary character-based metrics (e.g. Hamming distance, Levenshtein distance etc.) won't suffice as these metrics fail to capture complex linguistic patterns (hence why neural network methods are commonly used to pick up these patterns in large bodies of text). With that being said, one can begin to add their own "rules" to calculate more "accurate" similarity scores. For example, the code below modifies the normalised Hamming similarity computation to track how many consecutive characters match, and then scales the "similarity score" accordingly. There is obviously scope for fine-tuning and/or increasing the complexity/number of rules used, but with more complexity comes slower processing times. This custom function avoids the issue of results like beatte ➝ beats: 0.667 and beatus ➝ certus: 0.667, instead scoring them as beatte ➝ beats 0.79167 and beatus ➝ certus 0.33).

def custom_hamming_norm_sim(strA, strB, scale=0.5):
    max_str_len = max(len(strA), len(strB))
    max_score_per_char = 1 / max_str_len
    penalty = 1
    score = 0
    for c1, c2 in zip_longest(strA, strB):
        if c1 != c2:
            penalty = penalty * scale
            score += max_score_per_char * penalty
        else:
            p = penalty / scale
            if p < max_score_per_char:
                penalty = p
            score += max_score_per_char * penalty
    return score


@profile
def profile_distance_calcs():
    lemmas_low = [lemma.lower() for lemma in lemmas]
    forms_low = [form.lower() for form in forms]
    for form in forms_low:
        form_distances = []
        for lemma in lemmas_low:
            dist = custom_hamming_norm_sim(lemma, form)
            if dist > 0.25:
                insort(form_distances, (dist, lemma))
        distances[form] = form_distances

    with open("potential_lemmas_hamming.txt", "w") as f:
        for form, form_distances in distances.items():
            for dist, lemma in reversed(form_distances[-2:]):
                f.write(f"{form} ➝  {lemma}: {dist}\n")

if __name__ == "__main__":
    profile_distance_calcs()

Thanks @Kyle F Hartzenberg. The problem with jaro_winkler_metric is that is not very appropriate to work with the languages I work. Here's an example. One possible spelling for the word 'vermell' is 'uarnell'. A particular morpheme that can be used is 'íssim' and its variants. With jaro_winkler I get: uarnellíssimes ➝ atens: 0.611904761904762 and uarnellíssimes ➝ gran: 0.5436507936507936. The desired closest distance would be with 'vermell'. If I use textdistance.hamming.normalized_similarity: uarnellíssimes ➝ vermell: 0.2857142857142857. No other string is even considered. — jfontana, Apr 07 '23 at 22:52
@jfontana I've edited the answer to make use of the Hamming distance and checked that it provides the desired result "uarnellíssimes ➝ vermell: 0.2857142857142857" for that particular form and lemma. The provided code is approximately an order of magnitude faster (~90% speed-up) than your original. — Kyle F Hartzenberg, Apr 08 '23 at 05:35
Thanks @kyle-f-hartzenberg. Your suggestion worked really well when combined with parallelization. Please read the edit of my original question (especially the comments at the end). I would accept your answer right away, but I prefer to wait a little bit just in case there are more useful suggestions (by yourself or by others) that can help me even more. — jfontana, Apr 08 '23 at 12:11
@jfontana I've edited the answer again to provide an alternative similarity computation based on the Hamming distance. This might give you some ideas on how you can get the similarity scores closer to what you are after. Note that the additional complexity adds processing time (+50% which equates to +1 hour on your system with your dataset (still much faster than the original code)) so its a trade-off that you'll have to make. Modify the function to suit your edge cases/data. — Kyle F Hartzenberg, Apr 09 '23 at 11:42
Thanks Kyle. You are totally right that character-based metrics are not the most appropriate for the type of problem I'm trying to address. Not knowing about any other alternatives, this is the first thing I tried. I will tweak the relevant parameters in your code to see if I can get better results. I have posted another question in Stack Overflow asking for other types of approaches to solve this problem. If there exists a tool that uses neural network methods that I can leverage to do this, I hope they will explain to me how to use it to do what I want. Thanks so much for your help. — jfontana, Apr 09 '23 at 13:01
@jfontana You are welcome, glad I could help. I saw that question pop up -- it's a good one that is not straightforward to answer so I look forward to seeing solutions. If I've got some time over the next few days I'll dig into some libraries to see if something works. In the mean time, I'd suggest adding some of the best edge cases (expected versus actual input/output) to the examples in your new question. That'll help people test implementations/libraries they know of. For example, the "uarmelisima" and "vermell" one is great, more like that and with explanation will go a long way. — Kyle F Hartzenberg, Apr 09 '23 at 13:28
Thenks Kyle. I see you work on AI. The kind of problem I have is one where AI would be very helpful. I have the feeling that the necessary tools are already out there if I just knew which ones they are and how to use them. You see, the corpora we have are already tagged and the tagging is actually pretty good considering the tools we used. (I'll use another comment to follow this because I have few characters left) — jfontana, Apr 09 '23 at 14:31
The problem is that because all that huge amount of variation in the spellings of the words and on the word order through the centuries, the trained tagger can only get so far. You can improve the accuracy by manually tagging additional documents and adding them to the goldstandard (training corpus) but the process is too costly and has very few benefits. You'd need to increase the amount of manually tagged docs a lot for the tagger to do much better than it does. -- continued. — jfontana, Apr 09 '23 at 14:35
I wish I could feed all the corpus with tagged documents to an AI that could analyze the tagging and detect the kind of strange "deviations" from the expected patterns that the human can most of the times easily detect. All the solutions that have been suggested to me using "deep learning" tools involve using a manually tagged corpora to train them. I don't have the time or the knowledge to do a comparison, but I doubt I would get better results. Standard taggers are, after all, "deep learners". I'd like for AI to train itself analyzing messy data and flag inconsistencies in that messy data. — jfontana, Apr 09 '23 at 14:49

jqurious · Answer 2 · 2023-04-18T12:32:03.953

1

Update: rapidfuzz v3.0.0 now allows different length strings in rapidfuzz.distance.Hamming

You could use rapidfuzz directly.

https://maxbachmann.github.io/RapidFuzz/Usage/process.html#cdist

Example showing how to get the top_n results scored by rapidfuzz.distance.Hamming:

import rapidfuzz
import numpy as np

lemmas = [
    "gran", "vermell", "groc", "atens", "Do", "dOne", "PUrpose", "can", "be", "use",
    "for", "cannon", "amuse", "useful", "user", "become", "downtown", "develop", "fulminate",
    "deduce", "de", "bezant",
]

forms = [
    "preriarenos", "Marinara", "Grand", "Gran", "Grans", "Grands", "Grandeses", "Grandullons",
    "grand", "grandissisimus", "gran", "grans", "grands", "grandeses", "grandullons", "grandullon",
    "grandullones", "uermell", "uermells", "vermell", "vermells", "vermella", "vermelles", "varmellíssimes",
    "uarmellíssimes", "uermellíssimes", "uarnellíssimes", "varmellíssima", "uermella", "uarmella", "uarnella",
    "varnella", "uarnellas", "varnellas", "varmella", "uermelles", "grog", "grogues", "doNE", "donE",
    "doIng", "purposeful", "canonical", "becareful", "being", "berate", "best", "bezant", "full", "fulmination",
    "predict", "downgrade", "down", "developing", "deduct", "deducing",
]


lemma_lowers = [lemma.lower() for lemma in lemmas]
form_lowers = [form.lower() for form in forms]

scorer = rapidfuzz.distance.Hamming.normalized_similarity

distances = rapidfuzz.process.cdist(
    form_lowers, lemma_lowers, scorer=scorer, workers=-1
)

# apparently `heapq` is "slow"
# use np.argpartition instead: https://stackoverflow.com/a/23734295
top_n = 2

idxs = np.argpartition(-distances, top_n)[:, :top_n]
scores = -np.partition(-distances, top_n)[:, :top_n]

for n, (idx1, idx2, score1, score2) in enumerate(np.hstack([idxs, scores])):
    print(form_lowers[n], "➝ ", lemma_lowers[int(idx1)], ":", score1)
    print(form_lowers[n], "➝ ", lemma_lowers[int(idx2)], ":", score2)

Based on a small benchmark, your code took 1m58.226s

Using rapidfuzz.process.cdist took 0m11.481s

edited Apr 18 '23 at 12:32

answered Apr 06 '23 at 10:45

jqurious

9,953
1
4
14

I'm not sure how to implement this with my code. As I said in my comment to Kyle, not all the algorithms seem to work well for my purposes. The one that seems to be the best is the hamming.normalized_similarity. I've tried to use it with rapidfuzz but it gives me *very* different results from the implementation in textdistance. – jfontana Apr 07 '23 at 22:58
@jfontana Ah okay. Could you perhaps add a small lemma/forms example we can run that show this difference? It would help in trying to figure out why they differ. – jqurious Apr 08 '23 at 01:07
Thanks @jqurious. I actually don't remember what I did exactly. I thought I had used rapidfuzz.distance.Hamming.normalized_distance(lemma, form). I tried again and it turns out that any of the variations of the Hamming distance requires for the strings to be of the same length with rapidfuzz. Not so with other libraries like textdistance. I am really keen to use the implementation of the distance algorithms in rapidfuzz because of its use of cpython. While it is faster than textdistance, it doesn't seem to have any distance measure algorithm that is appropriate for the kind of data I work with – jfontana Apr 08 '23 at 11:23
The best one I've found is rapidfuzz.distance.JaroWinkler.distance but it is still not as good as the normalized Hamming distance. It gives you relatively high scores for strings whose similarity in "linguistic" terms is virtually none. The example of "uarnelles"/"uarmellíssimes" and the related word "vermell" is a good one. It looks like existing algorithms are good for detecting spelling errors but are not as good as detecting word relatedness/similarity in natural languages. I've done some searching and I can't find an algorithm that would be optimized for this kind of relatedness. – jfontana Apr 08 '23 at 11:28
The `textdistance` docs say it uses rapidfuzz if it's installed, I didn't look too closely, so I'm not sure how exactly it uses it. Can you actually add an example in code e.g. `lemma = ["uarnelles", ...]; forms = [...]` – jqurious Apr 08 '23 at 11:59
If you look at the edit of my original question, you can get the example you ask me for just by replacing: char_matches = [c1 != c2 for c1, c2 in zip_longest(lemma, form)] dist = 1 - (sum(char_matches)/len(char_matches)) with: dist = rapidfuzz.distance.JaroWinkler.distance(lemma_lower, form_lower) – jfontana Apr 08 '23 at 12:13
Ah, I missed that edit - thank you. Will take a look. – jqurious Apr 08 '23 at 12:32
I've added another example that pads the strings with whitespace to make them the same length, it uses the length of the longest string which means the score values can differ, but I don't think that affects the "order" so the "top N" values should still be the same. (I could be wrong) `textdistance` [does check for rapidfuzz](https://github.com/life4/textdistance/blob/master/textdistance/libraries.py#L178) but it uses another lib for hamming when the length differs. – jqurious Apr 09 '23 at 10:18
Thanks. I tried your code with the test lists and it worked well. When I tried it with the actual lists, it crashed and I got the error message: Canceled future for execute_request message before replies were done – jfontana Apr 09 '23 at 11:48
D'oh. Are you running this inside Jupyter? – jqurious Apr 09 '23 at 12:02
Yes. I was running it inside Jupyter but I tried it as an independent Python script and it still crashed. – jfontana Apr 09 '23 at 12:30
The padding is no longer required in the lässt Version of the library. It does now handle length differences as insertion / deletion – maxbachmann Apr 18 '23 at 10:59
@maxbachmann Wonderful! I'll update the answer. Thank you for the awesome library. \o/ – jqurious Apr 18 '23 at 11:03

What is the most efficient way to identify text similarity between items in large lists of strings in Python?

2 Answers2

Original Code Speed Profile

Measuring Natural Language Similarity

Linked