Check if one string includes a substring with Levenshtein distance of 1 from other string

Question

My problem is that we want our users to enter the code like this: 639195-EM-66-XA-53-WX somewhere in the input, so the result may look like this: The code is 639195-EM-66-XA-53-WX, let me in. We still want to match the string if they make a small error in the code (Levenshtein distance of 1). For example The code is 739195-EM-66-XA-53-WX, let me in. (changed 6 to 7 in the first letter of the code)

The algorithm should match even if user skips dashes, and it should ignore lowercase/uppercase letters. These requirements are easy to fulfil, because I can remove all dashes and do to_uppercase.

Is there an algorithm for something like that?

Generating all strings with the distance of 1 from original code is computationally expensive.

I was also thinking about using something like Levenshtein distance, but ignoring missing letters that user added in the second string, but that would allow wrong letters in the middle of the code.

Searching for the code in user input seems a little bit better, but still not very clean.

So you know the letters or words before and after the code, i.e. the sentence before the code was entered? Max Levenshtein distance of 1 between two strings is easy to implement efficiently O(n). — maraca, Dec 05 '17 at 23:15
No, I don’t know the letters before and after. I will make it clear in the question. — tkowal, Dec 05 '17 at 23:30
This is similar to the problem of finding the [Levenshtein distance to all substrings](https://stackoverflow.com/questions/8139958/algorithm-to-find-edit-distance-to-all-substrings) of a string. — Anderson Green, Apr 13 '22 at 20:08

score 4 · Accepted Answer · answered Dec 06 '17 at 09:36

I had an idea for a solution, maybe this is good enough for you:

As you said, first remove the dashes and make everything upper (or lower) case:

Sentence: THE CODE IS 639195EM66XA53WX, LET ME IN

Code: 639195EM66XA53WX

Split the code in the middle (c1 and c2), because Levenshtein distance of 1 means that there can only be one mistake (insertion, deletion or replacement of a single character), so one of c1 or c2 has to match if the code is present in the sentence with just 1 or less mistakes. Splitting in the middle because the longer both substrings of the code are the fewer matches you should get:

c1: 639195EM

c2: 66XA53WX

Now try to find c1 and c2 in your sentence, if you find a match then you either have to go forward (c1 matched) or backwards (c2 matched) in the sentence to check if the Levenshtein distance of the missing part is 1 or less.

So in your example you would find c2 and then:

Set pointers to the last character of c1 and the character before the match.
While the characters are the same reduce both pointers by 1 (go backwards in both strings).
If you can consume c1 completely this way you found an exact match (Levenshtein distance of 0).
Otherwise try the 3 possibilities for Levenshtein distance of 1:
1. Only move the pointer of the c1 backwards and see if the rest matches (deletion).
2. Only move the pointer of the sentence backwards and see if the rest matches (insertion).
3. Move both pointers backwards and see if the rest matches (replacement).
If one of them succeeds you found a match with Levenshtein distance of 1, otherwise the distance is higher.

The split is a compelling idea because it allows me to find the correct place in the user input. I was wondering if I can use Levenshtein distance directly on the second part of the code (to use out of the box solution), but it doesn't work for insertion. It looks like this algorithm is perfect for my case! — tkowal, Dec 06 '17 at 10:23

Check if one string includes a substring with Levenshtein distance of 1 from other string

1 Answers1