The CPython implementation of substring search (e.g. via in
) is implemented by the following algorithm.
def find(s, p):
# find first occurrence of p in s
n = len(s)
m = len(p)
skip = delta1(p)[p[m-1]]
i = 0
while i <= n-m:
if s[i+m-1] == p[m-1]: # (boyer-moore)
# potential match
if s[i:i+m-1] == p[:m-1]:
return i
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + skip # (horspool)
else:
# skip
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + 1
return -1 # not found
At least, according to this source (taken from this older answer) written by the author (?) of the CPython implementation.
This same source mentions a worst-case complexity of this algorithm as O(nm)
, where n
and m
are the lengths of the two strings. I am interested in whether this bound is tight. My question is:
Are there adversarial examples for the algorithm used in Python
in
? Can we give a sequence of pairs of strings(pattern, string)
so that runningpattern in string
takes quadratic (or at least superlinear) time?
The standard example that demonstrates the quadratic worst-case run-time of naive substring search, where string = 'a'*n
and pattern = 'a'*m + b
does not work.