SequenceMatcher - finding the two most similar elements of two or more lists of data

Question

I was trying to compare a set of strings to an already defined set of strings. For example, you want to find the addressee of a letter, which text is digitalized via OCR.

There is an array of adresses, which has dictionaries as elements. Each element, which is unique, contains ID, Name, Street, ZIP Code and City. This list will be 1000 entries long.

Since OCR scanned text can be inaccurate, we need to find the best matching candidates of strings with the list, which contains the addresses.

The text is 750 words long. We reduce the number of words by using an appropiate filter function, which firstly splits by whitespaces, stripts more whitespaces from each element, deletes all words less then 5 characters long and removes duplicats; the resulting list is 200 words long.

Since each addressee has 4 strings (Name Street, Zip code and city) and the remaining letter ist 200 words long, my comparrisson has to run 4 * 1000 * 200 = 800'000 times.

I have used python with medium success. Matches have correctly been found. However, the algorithm takes a long time to process a lot of letters (up to 50 hrs per 1500 letters). List comprehension has been applied. Is there a way to correctly (and not unessesary) implement multithreading? What if this application needs to run on a low spec server? My 6 core CPU does not complain about such tasks, however, I do not know how much time it will take to process a lot of documents on a small AWS instance.

>> len(addressees)
1000
>> addressees[0]
{"Name": "John Doe", "Zip": 12345, "Street": "Boulevard of broken dreams 2", "City": "Stockholm"}
>> letter[:5] # already filtered
["Insurance", "Taxation", "Identification", "1592212", "St0ckhlm", "Mozart"]
>> from difflib import SequenceMatcher
>> def get_similarity_per_element(addressees, letter):
    """compare the similarity of each word in the letter with the addressees"""
    ratios = []
    for l in letter:
        for a in addressee.items():
            ratios.append(int(100 * SequenceMatcher(None, a, l).ratio())) # using ints for faster arithmatic
    return max(ratios)
>> get_similarity_per_element(addressees[0], letter[:5]) # percentage of the most matching word in the letter with anything from the addressee
82
>> # then use this method to find all addressents with the max matching ratio
>> # if only one is greater then the others -> Done
>> # if more then one, but less then 3 are equal -> Interactive Promt -> Done
>> # else -> mark as not sortable -> Done.

I expected a faster processing for each document. (1 minute max), not 50 hrs per 1500 letters. I am sure this is the bottleneck, since the other tasks are working fast and flawless.

Is there a better (faster) way to do this?

In the docs is mentioned that SequenceMatcher can be quadratic, which is very slow. Why did you choose `SequenceMatcher`? — Dani Mesejo, Jan 03 '19 at 18:01
It appeared to me as an uncomplicated tool to get what I wanted (a percentage of similarity between two elements) — valerius21, Jan 03 '19 at 18:06
not only do the docs say it can be quadratic time, you also provide it no help by specifying what is junk and a double for loop which in it of itself is `O(n^2)` — gold_cy, Jan 03 '19 at 18:10
You are comparing words in letter with address items that can have multiple words, so things like street will always have low scores, which can lead to wrong results. Also why letter is showing Tax if you had removed all words less than 5 characters? — juvian, Jan 03 '19 at 18:10
@juvian I made the letter list up. It's corrected now. Thanks for pointing this out, also, the street and name have also been single words, which are not less then 5 letters long. I left this out in the example for the sake of simplicity — valerius21, Jan 03 '19 at 18:14
Multithreading won't help for a compute-bound task like this one. — martineau, Jan 03 '19 at 18:26

score 2 · Answer 1 · answered Jan 03 '19 at 18:35

A few quick tips:

1) Let me know how long does it take to do quick_ratio() or real_quick_ratio() instead of ratio()

2) Invert the order of the loops and use set_seq2 and set_seq1 so that SequenceMatcher reuses information

for a in addressee.items():
    s = SequenceMatcher()
    s.set_seq2(a)    
    for l in letter:
       s.set_seq1(l)
        ratios.append(int(100 * s.ratio()))

But a better solution would be something like @J_H describes

J_H · Answer 2 · 2019-01-03T18:43:33.903

You want to recognize inputs that are similar to dictionary words, e.g. "St0ckholm" -> "Stockholm". Transposition typos should be handled. Ok.

Possibly you would prefer to set autojunk=False. But a quadratic or cubic algorithm sounds like trouble if you're in a hurry.

Consider the Anagram Problem, where you're asked if an input word and a dictionary word are anagrams of one another. The straightforward solution is to compare the sorted strings for equality. Let's see if we can adapt that idea into a suitable data structure for your problem.

Pre-process your dictionary words into canonical keys that are easily looked up, and hang a list of one or more words off of each key. Use sorting to form the key. So for example we would have:

    'dgo' -> ['dog', 'god']

Store this map sorted by key.

Given an input word, you want to know if exactly that word appears in the dictionary, or if a version with limited edit distance appears in the dictionary. Sort the input word and probe the map for 1st entry greater or equal to that. Retrieve the (very short) list of candidate words and evaluate the distance between each of them and your input word. Output the best match. This happens very quickly.

For fuzzier matching, use both the 1st and 2nd entries >= target, plus the preceding entry, so you have a larger candidate set. Also, so far this approach is sensitive to deletion of "small" letters like "a" or "b", due to ascending sorting. So additionally form keys with descending sort, and probe the map for both types of key.

If you're willing to pip install packages, consider import soundex, which deliberately discards information from words, or import fuzzywuzzy.

SequenceMatcher - finding the two most similar elements of two or more lists of data

2 Answers2