Various search algorithms and performance for full text matching

Question

Let's say I have the following string:

/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt

How would I determine the fastest way to search this string for a match given input terms ['sicario', '419']

For example, the most basic version would be:

1) String contains:

s = '/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt'
terms = ['sicario', '419']
has_match = all([term.lower() in s.lower() for term in terms])

2) Regex

has_match = all([re.compile(term, re.I).search(s) for term in terms])

Other possible options would be:

Rabin-Karp
Boyer-Moore
And other string-searching algorithms

What would be the time complexity of the various algorithms?

Here is what I got for timing regex vs str() in:

 import timeit

 # Case insensitive
setup =  's="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; terms = ["sicario", "419"]'
print min(timeit.Timer('all([term in s.lower() for term in terms])', setup=setup).repeat(7, 1000))
0.00134181976318

# Case sensitive
setup =  's="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; terms = ["sicario", "419"]'
print min(timeit.Timer('all([term in s for term in terms])', setup=setup).repeat(7, 1000))
0.000231027603149


# Regex case insensitive
setup =  'import re; s="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; compiled_terms = [re.compile("sicario", re.I), re.compile("419", re.I)]'
print min(timeit.Timer('all([compiled_term.search(s) for compiled_term in compiled_terms])', setup=setup).repeat(7, 1000))
0.00111889839172


# Regex case sensitive
setup =  'import re; s="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; compiled_terms = [re.compile("sicario"), re.compile("419")]'
print min(timeit.Timer('all([compiled_term.search(s) for compiled_term in compiled_terms])', setup=setup).repeat(7, 1000))
0.000588893890381

It's pretty close, though case-sensitive string searches perform about 2x better than regex (at least on this input data).

kederrac · Accepted Answer · 2019-08-18T19:46:08.323

1

I think your basic version is the fastest (a combination of Boyer-Moore and Horspoo) available (sublinear search behaviour in good cases (O(n/m)), I will add small changes to your basic version:

def test():
    lower_s = s .lower()
    return all([term in lower_s for term in terms])

edited Aug 18 '19 at 19:46

answered Aug 18 '19 at 19:44

kederrac

16,819
6
32
55

@ruso -- could you please explain why it would be fastest though? Basically I'm trying to learn a bit more about the algorithms, etc. behind the scenes and why one approach would be better than another. – David542 Aug 18 '19 at 19:45
also you may have a look over the basic classification of search algorithms/Single-pattern algorithms https://en.wikipedia.org/wiki/String-searching_algorithm – kederrac Aug 18 '19 at 19:55
and I think is the fastest because it is implemented in c – kederrac Aug 18 '19 at 19:57

Various search algorithms and performance for full text matching

1 Answers1