42

I am thinking of how the in operator implement, for instance

>>> s1 = 'abcdef'
>>> s2 = 'bcd'
>>> s2 in s1
True

In CPython, which algorithm is used to implement the string match, and what is the time complexity? Is there any official document or wiki about this?

arshajii
  • 127,459
  • 24
  • 238
  • 287
mitchelllc
  • 1,607
  • 4
  • 20
  • 24

1 Answers1

57

It's a combination of Boyer-Moore and Horspool.

You can view the C code here:

Fast search/count implementation, based on a mix between Boyer-Moore and Horspool, with a few more bells and whistles on the top. For some more background, see: https://web.archive.org/web/20201107074620/http://effbot.org/zone/stringlib.htm.

From the link above:

When designing the new algorithm, I used the following constraints:

  • should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
  • small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
  • sublinear search behaviour in good cases (O(n/m))
  • no worse than the current algorithm in worst case (O(nm))
  • should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
  • many real-life searches should be good, very few should be worst case
  • reasonably simple implementation
Community
  • 1
  • 1
arshajii
  • 127,459
  • 24
  • 238
  • 287
  • Thanks for the quick reply! Based on this article, http://effbot.org/zone/stringlib.htm, the time complexity is sublinear , that is better than KMP algorithm. – mitchelllc Aug 09 '13 at 03:49
  • 1
    @mitchelllc In the *best cases* it can be sublinear. – arshajii Aug 09 '13 at 03:49
  • 1
    @arshajiii yes, that's what I want. Thanks again! – mitchelllc Aug 09 '13 at 03:52
  • 1
    @arshajiii One more question, do you know when the good cases happen? I cannot figure out that. thx – mitchelllc Aug 09 '13 at 14:56
  • 3
    @mitchelllc probably when there aren't frequent 'false partial matches' e.g. 'bcdefg' and searching for 'fg' rather than looking for 'aab' in 'aaaacaaab' – obataku Jul 18 '16 at 02:07