I am looking for an efficient implementation of a string similarity metric function in Python (or a lib that provides Python bindings).
I want to compare strings with an average of 10kb in size and I can't take any shortcuts like comparing line-by-line, I need to compare the entire thing. I don't really care, what exact metric will be used, as long as the results are reasonable and computation is fast. Here's what I've tried so far:
difflib.SequenceMatcher
from the standard lib.ratio()
gives good results, but takes >100ms for 10kb text.quick_ratio()
takes only half the time, but the results are sometimes far of the real value.python-Levenshtein
: levenshtein is an acceptable metric for my use case, butLevenshtein.ratio('foo', 'bar')
is not faster than theSequenceMatcher
.
Before I start benchmarking every lib on pypi that provides functions for measuring string similarity, maybe you can point me in the right direction? I'd love to reduce the time for a single comparison to less than 10ms (on commodity hardware), if possible.