3

I have a problem finding similar names of food in my database (there are about 100k products names). I've decided to use fuzz.token_sort_ratio from lib fuzzywuzzy to find similar product names. This is how it is works:

s1 = 'Pepsi Light'
s2 = 'Light Pepsi'
fuzz.token_sort_ratio(s1, s2)

100

Now I want to find all the names of products with similar words, which have the result of fuzz.token_sort_ratio >= 90 Here my code:

#Find similar
start=datetime.now()
l = list(v_foods.name[0:20000])
i=0
df = pd.DataFrame(columns=['name1', 'name2', 'probab_same'])
for k in range(len(l)):
    for s in range(k+1,len(l)):
        probability = fuzz.token_sort_ratio(l[k], l[s])
        if  probability >= 90:
            df.loc[i] = [l[k], l[s], probability]
            i +=1
print('Spent time: {}' .format(datetime.now() - start))           
df.head(5)   

It takes a lot of time. The more products I have, the more time it takes

  1. l = list(v_foods.name[0:5000]) Spent time: ~3 minutes
  2. l = list(v_foods.name[0:10000]) Spent time: ~13 minutes
  3. l = list(v_foods.name[0:20000]) Spent time: ~53 minutes

As I said above, my base has 100k names and it will work very slow. Are there any methods to optimize my algorithm?

Mark Jeronimus
  • 9,278
  • 3
  • 37
  • 50
  • 1
    Could you run a profiler to see exactly what is taking so long? I suspect it's the fuzz.token_sort_ratio method; in which case you will need to find another quicker similarity method. Moreover, your post should be posted on code review not on stack overflow. – Mathieu Nov 06 '20 at 09:09
  • I'm afraid that, if the comparison function (`token_sort_ratio`) is a black box, there isn't anything more efficient than O(N²). Comparing 100K elements *will* take ~25h unless the program can be compiled more efficiently. – Mark Jeronimus Nov 06 '20 at 09:29
  • You can get something better but it's getting to more advanced computer science. It's called a Levenshtein automaton. I don't know of any open source Python implementations (but also didn't look). – orlp Nov 06 '20 at 10:03
  • @Mathieu As a relative newcomer to this forum, I'm curious about your comment "your post should be posted on code review not on stack overflow". My understanding is that this post belongs here on SO, and the code review forum is meant for asking a more open-ended question like "What improvements can you folks suggest for my code?".. Here, OP is asking how to obtain a very specific improvement (speed). Is my understanding wrong? – fountainhead Nov 06 '20 at 10:32
  • @fountainhead SO is for more open question, on code usually not working. Code review is meant for working code only; to optimize it further; usually to improve the computation time and memory usage. Additionnaly, the overall "look" of the code/design and clarity is generally improved simulteneously. – Mathieu Nov 06 '20 at 10:59
  • @Mathieu - Interesting. I didn't read up on the rules myself. But in these 2+ years of my existence on SO, I did find a good percentage (5 to 10) of questions being specifically on speed improvement (eg using `numpy`). And the askers and answerers seem to include stalwarts, veterans and long-timers. Must be one of those unfortunate gaps between rules and their actual compliance then? – fountainhead Nov 06 '20 at 11:10
  • @Mathieu My informal rule is that it depends if they want to improve their existing program, or write a new one that they don't know how to write. If they want improvement, I recommend code review. If someone wants a completely different algorithm, I am happy to answer. (As I did here.) – btilly Nov 06 '20 at 18:19
  • @btilly I agree and I believe the distinction between both isn't clear. I didn't read your answer; but on this post, the optimization seems more tied to a change in method than to a code optimization, and thus seems suited for SO too. – Mathieu Nov 06 '20 at 20:07

2 Answers2

7

Your problem is that you are comparing each name to each other name. That's n^2 comparisons and so gets slow. What you need to do is only compare pairs of names that have a chance of being similar enough.

To do better, we need to know what the library is actually doing. Thanks to this excellent answer we can tell that. What it calls fuzz._process_and_sort(name, True) on both names, then looks for a Levenshtein ratio. Which is to say that it computes a best way to get from one string to the other, and then calculates 100 * matched_chars / (matched_chars + edits). For this score to come out to 90+, the number of edits is at most len(name) / 9. (That condition is necessary but not sufficient, if those edits include substitutions and deletions in this string, that lowers the number of matched characters and lowers the ratio.)

So you can normalize all of the names quite easily. The question is can you find for a given normalized name, all the other normalized names at a maximum number of edits from this one?

The trick to that is to first put all of your normalized names into a Trie data structure. And then we can walk the Trie in parallel to explore all branches that are within a certain edit distance. This allows big groups of normalized names that are out of that distance to be dropped without examining them individually.

Here is a Python implementation of the Trie that will let you find those pairs of normalized names.

import re

# Now we will build a trie.  Every node has a list of words, and a dictionary
# from the next letter farther in the trie.
class Trie:
    def __init__(self, path=''):
        self.strings = []
        self.dict = {}
        self.count_strings = 0
        self.path = path

    def add_string (self, string):
        trie = self

        for letter in string:
            trie.count_strings += 1
            if letter not in trie.dict:
                trie.dict[letter] = Trie(trie.path + letter)
            trie = trie.dict[letter]
        trie.count_strings += 1
        trie.strings.append(string)

    def __hash__ (self):
        return id(self)

    def __repr__ (self):
        answer = self.path + ":\n  count_strings:" + str(self.count_strings) + "\n  strings: " + str(self.strings) + "\n  dict:"
        def indent (string):
            p = re.compile("^(?!:$)", re.M)
            return p.sub("    ", string)
        for letter in sorted(self.dict.keys()):
            subtrie = self.dict[letter]
            answer = answer + indent("\n" + subtrie.__repr__())
        return answer

    def within_edits(self, string, max_edits):
        # This will be all trie/string pos pairs that we have seen
        found = set()
        # This will be all trie/string pos pairs that we start the next edit with
        start_at_edit = set()

        # At distance 0 we start with the base of the trie can match the start of the string.
        start_at_edit.add((self, 0))
        answers = []
        for edits in range(max_edits + 1): # 0..max_edits inclusive
            start_at_next_edit = set()
            todo = list(start_at_edit)
            for trie, pos in todo:
                if (trie, pos) not in found: # Have we processed this?
                    found.add((trie, pos))
                    if pos == len(string):
                        answers.extend(trie.strings) # ANSWERS FOUND HERE!!!
                        # We have to delete from the other string
                        for next_trie in trie.dict.values():
                            start_at_next_edit.add((next_trie, pos))
                    else:
                        # This string could have an insertion
                        start_at_next_edit.add((trie, pos+1))
                        for letter, next_trie in trie.dict.items():
                            # We could have had a a deletion in this string
                            start_at_next_edit.add((next_trie, pos))
                            if letter == string[pos]:
                                todo.append((next_trie, pos+1)) # we matched farther
                            else:
                                # Could have been a substitution
                                start_at_next_edit.add((next_trie, pos+1))
            start_at_edit = start_at_next_edit
        return answers

# Sample useage
trie = Trie()
trie.add_string('foo')
trie.add_string('bar')
trie.add_string('baz')
print(trie.within_edits('ba', 1))
btilly
  • 43,296
  • 3
  • 59
  • 88
2

As others pointed out FuzzyWuzzy uses the Levenshtein distance, which is O(N^2). However in your code there are quite a few things that can be optimised to improve the runtime a lot. This will not be as fast as the trie implementation of btilly, but you will keep a similar behaviour (e.g. sorting the words beforehand)

  1. use RapidFuzz instead of FuzzyWuzzy (I am the author). It implements the same algorithms, but it is a lot faster.

  2. your currently preprocessing strings on each call to fuzz.token_sort_ratio, which could be done once beforehand.

  3. You can pass your score_cutoff to rapidfuzz, so it can exit early with a score of 0, when it knows the score can not be reached.

The following implementation takes around 30 seconds on my machine, while your current implementation runs about 7 minutes.

from rapidfuzz import fuzz, utils
import random
import string
from datetime import datetime
import pandas as pd

random.seed(18)

l = [''.join(random.choice(string.ascii_letters + string.digits + string.whitespace)
       for _ in range(random.randint(10, 20))
    )
    for s in range(10000)
]

start=datetime.now()
processed=[utils.default_process(name) for name in l]
res = []

for k in range(len(l)):
    for s in range(k+1,len(l)):
        probability = fuzz.token_sort_ratio(
            processed[k], processed[s], processor=None, score_cutoff=90)
        if  probability:
            res.append([l[k], l[s], probability])

df = pd.DataFrame(res, columns=['name1', 'name2', 'probab_same'])

print('Spent time: {}' .format(datetime.now() - start))           
print(df.head(5))
maxbachmann
  • 2,862
  • 1
  • 11
  • 35