Faster way to print all starting indices of a substring in a string, including overlapping occurences

Question

I'm trying to answer this homework question: Find all occurrences of a pattern in a string. Different occurrences of a substring can overlap with each other.

Sample 1.

Input:

TACG

GT

Output:

Explanation: The pattern is longer than the text and hence has no occurrences in the text.

Sample 2.

Input:

ATA

ATATA

Output:

0 2

Explanation: The pattern appears at positions 1 and 3 (and these two occurrences overlap each other).

Sample 3.

ATAT

GATATATGCATATACTT

Output:

1 3 9

Explanation: The pattern appears at positions 1, 3, and 9 in the text.

The answer I'm submitting is this one:

def all_indices(text, pattern):
    i = text.find(pattern)
    while i >= 0:
        print(i, end=' ')
        i = text.find(pattern, i + 1)


if __name__ == '__main__':
    text = input()
    pattern = input()
    all_indices(text, pattern)

However, this code is failing the final test cases:

Failed case #63/64: time limit exceeded (Time used: 7.98/4.00, memory used: 77647872/536870912.)

The online judge knows I'm sending the answer in Python, and has different time limits for different languages.

I have searched quite a bit for other answers and approaches: regexes, suffix trees, Aho-Corasick... but so far all of them underperform this simple solution (maybe because find is implemented in C?).

So my question is: are there ways to do this task faster?

@Barmar in which case? I just tried the second most voted answer on this question: http://stackoverflow.com/questions/4664850/find-all-occurrences-of-a-substring-in-python?noredirect=1&lq=1 , and the result is the same. — Marcus Vinícius Monteiro, Apr 14 '17 at 15:25
I'm not sure. The bug doesn't seem obvious to me. If you run it yourself, does it eventually finish? How long does it take? — Barmar, Apr 14 '17 at 15:29
@Barmar unfortunately, they don't provide us the test cases :( — Marcus Vinícius Monteiro, Apr 14 '17 at 15:30
Any algorithm that does this will take time proportional to the length of the text. So they can always make a text long enough to exceed some time limit. — Barmar, Apr 14 '17 at 15:32
Have they told you the maximum possible lengths of the strings? — Barmar, Apr 14 '17 at 15:33
The worst case would be something like `pattern = 'A'`, `text = 'A' * 10e6` — Barmar, Apr 14 '17 at 15:37
@Barmar thank you. It terminates! For the case you gave, just took a lot longer than 4 seconds to print all the string indices :D. Still wondering if there is a faster way... — Marcus Vinícius Monteiro, Apr 14 '17 at 15:53
One optimization is to stop when `i > len(text) - len(pattern)` — Barmar, Apr 14 '17 at 16:01
@Barmar I'm now thinking that the `print` calls are slowing down my answer... — Marcus Vinícius Monteiro, Apr 14 '17 at 16:01
I assume you're required to print the indexes, so there's nothing you can do to avoid that cost. — Barmar, Apr 14 '17 at 16:02
@Barmar true. The time is still the same with the optimization you mentioned, probably because the test case is the one you gave me before. I'll try implementing different algorithms to see if the test cases are more lenient towards a specific one. — Marcus Vinícius Monteiro, Apr 14 '17 at 16:12
That optimization will help with a case like `text = really long string`, `pattern = almost as long string that only appears once` — Barmar, Apr 14 '17 at 16:22

score 1 · Answer 1 · answered Apr 14 '17 at 16:14

If print is what slows your program the most, you should try to call it as little as possible. A quick and dirty solution to your problem:

def all_indices(string, pattern):
    result = []
    idx = string.find(pattern)
    while idx >= 0:
        result.append(str(idx))
        idx = string.find(pattern, idx + 1)
    return result

if __name__ == '__main__':
    string = input()
    pattern = input()
    ' '.join(all_indices(string, pattern))

In the future to correctly identify which part of your code is slowing down the whole process you can use the python profilers

score 0 · Answer 2 · answered Apr 14 '17 at 16:37

I believe that the test cases were being more lenient towards the Knuth-Morris-Pratt algorithm. This code, copied from https://en.wikibooks.org/wiki/Algorithm_Implementation/String_searching/Knuth-Morris-Pratt_pattern_matcher#Python, passed all the cases:

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

#from http://code.activestate.com/recipes/117214/
def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
    Calling conventions are similar to string.find, but its arguments can be
    lists or iterators, not just strings, it returns all matches, not just
    the first one, and it does not need the whole text in memory at once.
    Whenever it yields, it will have read the text exactly up to and including
    the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

Faster way to print all starting indices of a substring in a string, including overlapping occurences

2 Answers2