4

I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).

I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:

  • difflib.SequenceMatcher from the standard lib. ratio() gives good results, but takes >100ms for 10kb text. quick_ratio() takes only half the time, but the results are sometimes far of the real value.
  • python-Levenshtein: levenshtein is an acceptable metric for my use case, but Levenshtein.ratio('foo', 'bar') is not faster than the SequenceMatcher.

Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.

klamann
  • 1,697
  • 17
  • 28
  • I think the ~quadratic complexity makes this quite hard. [Here](https://stackoverflow.com/questions/4057513/levenshtein-distance-algorithm-better-than-onm) are some mentionings of alternatives (something like a fixed-parameter alg + an approximation-algorithm), but your question feels a bit broad to evaluate those (*i don't really care, what exact metric* + unknown data). – sascha May 29 '18 at 12:16
  • If you're interested in the specifics: I'm working on a system for near-duplicate detection, which uses MinHashes for identifying possible duplicates. Sometimes it happens that I get a ton of possible duplicates and I need to find the best match. That's why I don't care what *exact* metric I use as long as it is a metric in the mathematical sense. Also, I don't care about the data, it should apply to any kind of string. And with inputs that are reasonably small, even algorithms with quadratic complexity should perform better than those that I have reviewed so far (at least I hope so...) – klamann May 29 '18 at 12:44
  • I can't offer more (it's not my area of expertise) and you will need to decide if those alternatives are worth some attempt (coding needed). But just taking your number + quadratic complexity with some pretty dumb calculation (assuming byte = char) like: ```10kB = 10240 B -> 10240^2 = 104.857.600```, 100ms looks quite fast (to me). – sascha May 29 '18 at 12:47
  • well, if i'd have to compare every character with each other, then yes, this would be pretty slow. But I think there's room for optimization here. Also, I don't need the best possible solution, I'm totally fine with a reasonable approximation with maybe up to 3% error – klamann May 29 '18 at 12:56

3 Answers3

6

edlib seems to be fast enough for my use case.

It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

klamann
  • 1,697
  • 17
  • 28
1

I've had some luck with RapidFuzz, I don't know how it compares to the others but it was much faster than thefuzz/fuzzywuzzy.

Don't know if it's applicable for your use-case, but this one of the first things you find when you google fast string similarity python

Danferno
  • 472
  • 5
  • 12
0

Based on a lot of reading up I've been able to do, something like tfidf_matcher worked well for me. Returns the best k matches. Also, it's easily 1000x faster than Fuzzywuzzy.

  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 12 '22 at 14:13