How can you parallelize a regex search of one long string?

Question

I'm testing the output of a simulation to see if it enters a loop at some point, so I need to know if the output repeats itself. For example, there may be 400 digits, followed by a 400000 digit cycle. The output consists only of digits from 0-9. I have the following regex function that I'm using to match repetitions in a single long string:

def repetitions(s):
    r = re.compile(r"(.+?)\1+")
    for match in r.finditer(s):
        if len(match.group(1)) > 1 and len(match.group(0))/len(match.group(1)) > 4:
            yield (match.group(1), len(match.group(0))/len(match.group(1)))

This function works fantastically, but it takes far too long. My most recent test was 4 million digits, and it took 4.5 hours to search. It found no repetitions, so I now need to increase the search space. The code only concerns itself with subsequences that repeat themselves more than 4 times because I'm considering 5 repetitions to give a set that can be checked manually: the simulation will generate subsequences that will repeat hundreds of times. I'm running on a four core machine, and the digits to be checked are generated in real time. How can I increase the speed of the search?

That question deals with strings that are entirely repetitions, something like "123451234512345". I need to know if the string eventually repeats itself, something like "00706857383123451234512345". The solutions in that question don't catch this case. — interplex, Jul 06 '15 at 20:39
Is there any sort of pattern around the repetitive portions? Is it always between underscores, or always repeating until the end of the string, that sort of thing? — TigerhawkT3, Jul 06 '15 at 20:45
It always repeats to the end of the string, but perhaps not fully. Something like "8576463812345123451234512345123" is fully possible. — interplex, Jul 06 '15 at 20:50
@interplex In that case, certainly, starting at the end of the string and going backwards (i.e. `r.finditer(reverse(s))`) would help. — Lucas Ross, Jul 06 '15 at 20:52
I'd recommend adapting one of the answers in the linked question rather than attempting to parallelize this regex, because 4.5 hours distributed among four cores is still nowhere close to real time. — TigerhawkT3, Jul 06 '15 at 20:55
What's the shortest that the repetitive portion of the string could be? Like in your comment example above, you have "00706857383123451234512345", which repeats "12345", which is 5 digits long. Is 5 the shortest that repetitive portion could be? Certainly it must be greater than one? — justinpawela, Jul 06 '15 at 21:04
@user2194039 The shortest the repetitive section can be is 2 digits. Other than that, there's no definite minimum length. — interplex, Jul 06 '15 at 21:18
@TigerhawkT3 the search isn't done in real time, the string is generated in real time. I'm looking for ways to speed up the search. — interplex, Jul 06 '15 at 21:19
Once you decide on the best algorithm to use, running the code with PyPy might help to speed it up. — rth, Jul 06 '15 at 22:21
There are a number of algorithmic directions one can go in, namely parallelizing a NFA or using a custom dynamic programming solution. If this is mumbo jumbo to you, that's okay just ignore, I'm trying to motivate the following questions. Your problem statement is a bit vague in a number of ways. First, the code you give concerns itself with only repetitions greater than 4 and that's kind of important. Second, can we limit this to just digits in 0-9? Or single chars? 3rd: you want *all* repetitions, so just listing the results can be exponential. For example, consider 32 1's, — rocky, Jul 06 '15 at 22:23
You might consider switching to C++ and using its regex package. it is likely to be a lot faster than Python. — Ira Baxter, Jul 06 '15 at 22:54
I tried the OP's regex solution on small strings like `'abc1231231'` and `'123123123123'`, and it always seems to return an empty generator. — TigerhawkT3, Jul 06 '15 at 22:55
@rocky The sequences I'm dealing with end up being repeated hundreds of times once they start repeating. Greater than 4 repetitions was just a method to weed out shorter sequences. Yes, the patterns are composed of just digits 0-9. — interplex, Jul 07 '15 at 03:21
@interplex there are two other features in the code that are not explicitly mentioned in the problem statement. First is that within a repetition you seem to want just the longest match of a given sequence. The second feature seems to be that you want *all* matches. And here there is still some ambiguity. Suppose the input is '1234123452345'. There are two repetitions of '1234' and two of '2345', and they overlap. When I run your function on it finds only one result, '1234' but misses the overlapping '2345'. If these patterns don't overlap we get two matches. Please clarify. — rocky, Jul 07 '15 at 04:06
@IraBaxter: It's still backtracking engine (and must use backtracking engine even if implemented as hybrid). Not going to be any faster on long strings. — nhahtdh, Jul 07 '15 at 04:15

rocky · Accepted Answer · 2015-07-08T08:06:17.350

Based on information given by nhahtdh in one of the other answers, some things have come to light.

First, the problem you are posing is called finding "tandem repeats" or "squares".

Second, the algorithm given in http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf finds z tandem repeats in O(n log n + z) time and is "optimal" in the sense that there can be that many answers. You may be able to use parallelize the tandem searches, but I'd first do timings with the simple-minded approach and divide by 4 to see if that is in the speed range you expect.

Also, in order to use this approach you are going to need O(n) space to store this suffix tree. So if you have on the order of 400,000 digits, you are going to need on the order of 400,000 time to build and 400,000 bytes to and store this suffix tree.

I am not totally what is meant by searching in "real time", I usually think of it as a hard limit on how long an operation can take. If that's the case, then that's not going to happen here. This algorithm needs to read in the entire input string and processes that before you start to get results. In that sense, it is what's called an "off-line" algorithm,.

http://web.cs.ucdavis.edu/~gusfield/strmat.html has C code that you can download. (In tar file strmat.tar.gz look for repeats_tandem.c and repeats_tandem.h).

In light of the above, if that algorithm isn't sufficiently fast or space efficient, I'd look for ways to change or narrow the problem. Maybe you only need a fixed number of answers (e.g. up to 5)? If the cycles are a result of executing statements in a program, given that programming languages (other than assembler) don't have arbitrary "goto" statements, it's possible that this can narrow the kinds of cycles that can occur and somehow by make use of that structure might offer a way to speed things up.

`As nhahtdh reports you can run tandem searches so that will speed this up by at most 4 times,` I didn't say anything about this, so if it's your interpretation of the paper, say so. — nhahtdh, Jul 07 '15 at 06:22
@nhahtdh Ok, my mistake. I've edited this. What can be done in parallel is Phase II of the paper where there is a worklist and possibly algoithm 4. My main point here was just the simple one that you are going that using 4 cores improves by at most 4. — rocky, Jul 07 '15 at 06:41
This seems to be the best option given my current question statement. I'm not quite sure how to run the C code, but I'll take a crack at it. Thanks! — interplex, Jul 08 '15 at 04:44

Ira Baxter · Answer 2 · 2015-07-07T06:05:07.183

1

When one algorithm is too slow, switch algorithms.

If you are looking for repeating strings, you might consider using a suffix tree scheme: https://en.wikipedia.org/wiki/Suffix_tree

This will find common substrings in for you in linear time. EDIT: @nhahtdh inb a comment below has referenced a paper that tells you how to pick out all z tandem repeats very quickly. If somebody upvotes my answer, @nhahtdh should logically get some of the credit.

I haven't tried it, but I'd guess that you might be able to parallelize the construction of the suffix tree itself.

edited Jul 07 '15 at 06:05

answered Jul 06 '15 at 22:58

Ira Baxter

93,541
22
172
341

2

I think you should edit your post to mention the fact that suffix tree can be used to find all z tandem repeats in O(n log n + z). This seems to be the paper: http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf – nhahtdh Jul 07 '15 at 04:41

TigerhawkT3 · Answer 3 · 2015-07-06T23:43:39.310

I'm sure there's room for optimization, but test this algorithm on shorter strings to see how it compares to your current solution:

def partial_repeat(string):
    l = len(string)
    for i in range(2, l//2+1):
        s = string[0:i]
        multi = l//i-1
        factor = l//(i-1)
        ls = len(s)
        if s*(multi) == string[:ls*(multi)] and len(string)-len(string[:ls*factor]) <= ls and s*2 in string:
            return s

>>> test_string
'abc1231231231231'
>>> results = {x for x in (partial_repeat(test_string[i:]) for i in range(len(test_string))) if x}
>>> sorted(sorted(results, key=test_string.index), key=test_string.count, reverse=True)[0]
'123'

In this test string, it's unclear whether the non-repeating initial characters are 'abc' or 'abc1', so the repeating string could be either '123' or '231'. The above sorts each found substring by its earliest appearance in the test string, sorts again (sorted() is a stable sort) by the highest frequency, and takes the top result.

With standard loops and min() instead of comprehensions and sorted():

>>> g = {partial_repeat(test_string[i:]) for i in range(len(test_string))}
>>> results = set()
>>> for x in g:
...     if x and (not results or test_string.count(x) >= min(map(test_string.count, results))):
...         results.add(x)
...
>>> min(results, key=test_string.index)
'123'

I tested these solutions with the test string 'abc123123a' multiplied by (n for n in range(100, 10101, 500) to get some timing data. I entered these data into Excel and used its FORECAST() function to estimate the processing time of a 4-million character string at 430 seconds, or about seven minutes.

I was interested in seeing how this performed on the strings in the actual dataset, but it looks like it slipped under the radar. Oh well. — TigerhawkT3, Jul 07 '15 at 09:55

How can you parallelize a regex search of one long string?

3 Answers3