Is there any specific way to implement Damerau-Levenshtein distances so they perform better on integer pairs?

Question

Say I have a list of numbers composed of a sequential ID followed by two mod11 check digits, for example:

ls = [191, 272, 353, 434, 515, 604, 787, 868, 949, 1082, 1163, 1244, 1325, 1406, 
      1597, 1678, 1759, 1830, 1910, 2054, 2135, 2216, 2305, 2488, 2569, 2640, 2720, 
      2801, 2992, 3026, 3107, 3298, 3379, 3450, 3530, 3611, 3700, 3883, 3964, 4006, 
      4189, 4260, 4340, 4421, 4502, 4693, 4774, 4855, 4936, ...]

And I want to understand, for each of my keys, the smallest Damerau-Levenshtein distance achievable to another key with the same magnitude in the list. len(ls) == 10**9, so doing it in the most optimal manner is going to be crucial.

My current approach is the following:

from tqdm.notebook import trange, tqdm
from fastDamerauLevenshtein import damerauLevenshtein

###################################
# Insert code to generate ls here #
###################################
def calculate_dual_mod11(n):
    m11 = ((n//10**8) + (n%10**8//10**7)*2 + (n%10**7//10**6)*3 + (n%10**6//10**5)*4 + (n%10**5//10**4)*5 + \
         (n%10**4//10**3)*6 + (n%10**3//10**2)*7 + (n%10**2//10)*8 + (n%10)*9)%11%10
    n_backup = n*10 + m11
    n = n_backup%10**9
    
    m11 = ((n//10**8) + (n%10**8//10**7)*2 + (n%10**7//10**6)*3 + (n%10**6//10**5)*4 + (n%10**5//10**4)*5 + \
         (n%10**4//10**3)*6 + (n%10**3//10**2)*7 + (n%10**2//10)*8 + (n%10)*9)%11%10
    return (n*10 + m11)%100

ls = [i*100+calculate_dual_mod11(i) for i in range(10**9)]

# Instantiate a list of Nones to prevent those pesky O(n) inserts
# when CPython needs to move data. This way, we only have one O(n)
# operation and a bunch of O(1)s later down the road.
damerau_ls = [None]*len(ls) 

for i in trange(len(ls)):
    first_str = str(ls[i])

    # Except for the edge cases beginning in a sequence of 9s
    # where the next key is a character longer, the upper bound
    # of the minimum Damerau-Levenshtein distance is equal to the first string's size
    # I'm willing to compromise on those edge cases.
    damerau_ls[i] = len(first_str) 

    for j in range(i+1,len(ls)):
        other_str = str(ls[j])
        # Once we get to a string longer than the main, break the inner loop.
        if len(other_str) > len(first_str):
            break
        damerau_ls[i] = min(damerau_ls[i], damerauLevenshtein(first_str, other_str, similarity=False))
        # Damerau-Levenshtein distance between two distinct strings will never be less than 1.
        if damerau_ls[i] == 1:
            break

Is there any faster way to achieve this, especially given the constraint that I'm working with integers and that they have the same order of magnitude? This code runs pretty quickly for the first 1000 iterations, then it evidently starts getting considerably slower as we progress through the longer strings.

The specific Damerau-Levenshtein implementation I'm using is fastDamerauLevenshtein.

Hey, did you try the rust implementation? https://github.com/joleaf/pyrsdameraulevenshtein — Jonas Blatt, Mar 03 '23 at 08:09

Is there any specific way to implement Damerau-Levenshtein distances so they perform better on integer pairs?

0 Answers0