Adversarial inputs for Python substring search

Question

The CPython implementation of substring search (e.g. via in) is implemented by the following algorithm.

def find(s, p):
    # find first occurrence of p in s
    n = len(s)
    m = len(p)
    skip = delta1(p)[p[m-1]]
    i = 0
    while i <= n-m:
        if s[i+m-1] == p[m-1]: # (boyer-moore)
            # potential match
            if s[i:i+m-1] == p[:m-1]:
                return i
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + skip # (horspool)
        else:
            # skip
            if s[i+m] not in p:
                i = i + m + 1 # (sunday)
            else:
                i = i + 1
    return -1 # not found

At least, according to this source (taken from this older answer) written by the author (?) of the CPython implementation.

This same source mentions a worst-case complexity of this algorithm as O(nm), where n and m are the lengths of the two strings. I am interested in whether this bound is tight. My question is:

Are there adversarial examples for the algorithm used in Python in? Can we give a sequence of pairs of strings (pattern, string) so that running pattern in string takes quadratic (or at least superlinear) time?

The standard example that demonstrates the quadratic worst-case run-time of naive substring search, where string = 'a'*n and pattern = 'a'*m + b does not work.

What is the space complexity of algorithms you're willing to consider? There's a simple O(1) time substring search algorithm with takes O(n^2) extra space, for instance (provided that you're not counting the amortized cost of building the data structure, obviously). — Patrick87, Jun 25 '19 at 12:30
@Patrick87, I am not interested in considering different string search algorithms. I am interested in the specific one implemented in CPython, described in the link(s) provided, and what its worst-case time complexity is. More specifically, I am interested in finding a family of strings on which this worst-case time complexity is 'achieved'. — Mees de Vries, Jun 25 '19 at 12:52
I did not vote on this, however I think the question could be greatly improved by focusing on the question first and cutting down on text. E.g. What is the worst time complexity for this algorithm and what string and substring would incur it? — wihlke, Jun 27 '19 at 11:52
@wihlke, thank you for your suggestions, I have updated the question to focus on the actual question and less on the details of why I am asking it. — Mees de Vries, Jun 27 '19 at 13:58

score 4 · Accepted Answer · answered Jun 28 '19 at 19:51

The naive example of s='a'*n and p='a'*m+'b' does not work because of the line

if s[i+m-1] == p[m-1]:

This checks the last character (not the first) of p ('b') with the corresponding current position in s. As this fails, then the result is to just a single iteration over s, which is why it is so fast.

If you flip p (s='a'*n and p='b'+'a'*m), then a similar thing occurs - this time the above line passes (the last character of p is now 'a'), but then p is iterated over forwards, so then the 'b' is found quickly, so again this example is linear and fast.

A simple change to the naive example that would show O(nm) behaviour is s='a'*n and p='a'*m+'ba'. In this case, the last character of p is 'a', so the initial check passes, but then it needs to iterate over the rest of p before it gets to the 'b'.

# full='a'*n; sub='a'*m+'b'
>>> timeit("sub in full", "sub='a'*10+'b'; full='a'*100")
0.13620498299860628
>>> timeit("sub in full", "sub='a'*10+'b'; full='a'*1000")
0.9594046580004942
>>> timeit("sub in full", "sub='a'*100+'b'; full='a'*1000")
0.9768632190007338
# Linear in n, but m has minimal effect: ~O(n)

# full='a'*n; sub='a'*m+'ba'
>>> timeit("sub in full", "sub='a'*10+'ba'; full='a'*100")
0.35251976200015633
>>> timeit("sub in full", "sub='a'*10+'ba'; full='a'*1000")
3.4642483099996753
>>> timeit("sub in full", "sub='a'*100+'ba'; full='a'*1000")
27.152958754999418
# Both n and m have linear effect: ~O(nm)

score -1 · Answer 2 · answered Jun 25 '19 at 22:56

-1

Try this:

import re
import time

def slow_match(n):
    pat = 'a' + ('z' * n)
    str = 'z' * (n + n)
    start_time = time.time()
    if re.search(pat, str):
        print("Shouldn't happen")
    print(("Searched", n, time.time() - start_time))

slow_match(10000)
slow_match(50000)
slow_match(100000)
slow_match(300000)

answered Jun 25 '19 at 22:56

btilly

43,296
3
59
88

If I run this (including with larger `n`, successively doubling it up to 9600000) it appears to take linear time in the length of the input, not quadratic (or some other recognisable form of superlinear) time. Using `in` instead of `re.search` (which is what my question was about), the result is also given in what appears to be linear time, and much faster, too. – Mees de Vries Jun 26 '19 at 10:03

Adversarial inputs for Python substring search

2 Answers2