A regex for maximal periodic substrings

Question

This is a follow up to A regex to detect periodic strings .

A period p of a string w is any positive integer p such that w[i]=w[i+p] whenever both sides of this equation are defined. Let per(w) denote the size of the smallest period of w . We say that a string w is periodic iff per(w) <= |w|/2.

So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.

For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.

As more examples, abcabca, ababababa and abcabcabc are also periodic.

@horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is

\b(\w*)(\w+\1)\2+\b

I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.

Formally a substring w is a maximal periodic substring if it is periodic and neither w[i-1] = w[i-1+p] nor w[j+1] = w[j+1-p]. Informally, the "run" cannot be contained in a larger "run" with the same period.

The four maximal periodic substrings (runs) of string T = atattatt are T[4,5] = tt, T[7,8] = tt, T[1,4] = atat, T[2,8] = tattatt.
The string T = aabaabaaaacaacac contains the following 7 maximal periodic substrings (runs): T[1,2] = aa, T[4,5] = aa, T[7,10] = aaaa, T[12,13] = aa, T[13,16] = acac, T[1,8] = aabaabaa, T[9,15] = aacaaca.
The string T = atatbatatb contains the following three runs. They are: T[1, 4] = atat, T[6, 9] = atat and T[1, 10] = atatbatatb.

Is there a regex (with backreferences) that will capture all maximal periodic substrings?

I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re supports. However I would even be happy with PCRE if that makes the problem solvable.

(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )

Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.

"a periodic string is just [...] another string repeated at least twice". I think you meant present at least twice, so repeated at least once. — Aaron, Jul 12 '16 at 10:50
...and shouldn't the example say `T[2,7] = tattat`? (+ @WiktorStribiżew 's) — SamWhan, Jul 12 '16 at 10:58
@WiktorStribiżew `attatt` is not maximal as the t to the left of att is the same in both repeats. That is w[i-1] = w[i-1+p] where p = 3. — Simd, Jul 12 '16 at 10:59

score 2 · Accepted Answer · edited May 23 '17 at 12:09

This should do it, using Python's re module:

(?<=(.))(?=((\w*)(\w*(?!\1)\w\3)\4+))

Fiddle: https://regex101.com/r/aA9uJ0/2

Notes:

You must precede the string being scanned by a dummy character; the # in the fiddle. If that is a problem, it should be possible to work around it in the regex.
Get captured group 2 from each match to get the collection of maximal periodic substrings.
Haven't tried it with longer strings; performance may be an issue.

Explanation:

(?<=(.)) - look-behind to the character preceding the maximal periodic substring; captured as group 1
(?=...) - look-ahead, to ensure overlapping patterns are matched; see How to find overlapping matches with a regexp?
(...) - captures the maximal periodic substring (group 2)
(\w*)(\w*...\w\3)\4+ - @horcruz's regex, as proposed by OP
(?!\1) - negative look-ahead to group 1 to ensure the periodic substring is maximal

As pointed out by @ClasG, the result of my regex may be incomplete. This happens when two runs start at the same offset. Examples:

aabaab has 3 runs: aabaab, aa and aa. The first two runs start at the same offset. My regex will fail to return the shortest one.
atatbatatb has 3 runs: atatbatatb, atat, atat. Same problem here; my regex will only return the first and third run.

This may well be impossible to solve within the regex. As far as I know, there is no regex engine that is capable of returning two different matches that start at the same offset.

I see two possible solutions:

Ignore the missing runs. If I am not mistaken, then they are always duplicates; an identical run will follow within the same encapsulating run.
Do some postprocessing on the result. For every run found (let's call this X), scan earlier runs trying to find one that starts with the same character sequence (let's call this Y). When found (and not already 'used'), add an entry with the same character sequence as X, but the offset of Y.

Sweet, but it'll miss one of the `atat`-runs in `atatbatatb`. — SamWhan, Jul 12 '16 at 12:05
@ClasG You are right. The first `atat` is abandoned because the regex, being greedy, found a longer run `atatbatatb` starting at the same offset. It should yield both, as they have different periods. OP's original regex `(\w*)(\w+\1)\2+` is having the exact same issue. — Ruud Helderman, Jul 12 '16 at 12:25
When you say "OP's original regex (\w*)(\w+\1)\2+ is having the exact same issue" this problem doesn't show up when given a single string to recognise, right? And, do you think there is any way round the problem in your answer? — Simd, Jul 12 '16 at 13:25
@eleanora Fiddle to demonstrate the original regex suffers from the same issue: https://regex101.com/r/dO2eM2/1 Notice there are 2 matches; should be 3. See my edit for a possible solution. — Ruud Helderman, Jul 13 '16 at 07:18

score 0 · Answer 2 · answered Jul 12 '16 at 11:01

0

I think it is not possible. Regular expressions cannot do complex nondeterministic jobs, even with backreferences. You need an algorithm for this.

answered Jul 12 '16 at 11:01

logi-kal

7,107
6
31
43

1

Why is this nondeterministic? – Simd Jul 12 '16 at 11:02
A question - wouldn't `T=atatbatatb` have the runs `T[1,4]`, `T[1,10]` and `T[6,9]`. If I'm correct, then you've got two runs starting in position 1, thus (imo) impossible for a regex to handle, since it can't determine which one it's dealing with. – SamWhan Jul 12 '16 at 11:21
@eleanora I can't demonstrate it mathematically, but for me it's pretty evident. You can't build it up neither with a finite-state machine nor with a pushdown automaton. Even a stupid (not so stupid) language as `aᶻbᶻcᶻ` is acknowledged to be [context-sensitive](https://en.wikipedia.org/wiki/Context-sensitive_grammar) (and so nondeterministic), I think this is even harder. – logi-kal Jul 12 '16 at 11:25
@ClasG That's not the problem: you may have clashes like that even in simple expressions, and in general they are solved by the default greedy behaviour. – logi-kal Jul 12 '16 at 11:44
@eleanora An _informal_ demonstration is the following: you are searching for a string `(?<=rˣa?)sʸ(?=b?pᶻ)`, where `s`, `p` and `r` are substrings structured in a particular way (that surely makes the problem not easier, so we can simplify it), `a` and `b` are optional non-periodic substrings, and `x – logi-kal Jul 12 '16 at 12:41
@horcrux Yes this is not a regular language. The part that is unclear to me is if it sits in the language covered by some common extension to regular expressions such as exist in perl or python. These apparently cover lots of things that are not regular but are context sensitive. See https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html where it says that a modern regex covers all context free languages and "This time I can’t gave you definite answer. They certainly can match some context-sensitive languages, but I don’t know whether they can match all of them." – Simd Jul 12 '16 at 13:23
@eleanora You asked me why it is nondeterministic. And yes, constructs as backreferences or pattern cardinality (`{n,m}`) are examples of extensions to a non-regular language, because the calculator helps us a lot, sometimes even overstepping the limit of computationally feasible calculations. For example `ss` (with `s` in `[ab]+`) is context-sensitive but we just get it with the regex `([ab]+)\1`. But we can't get everywhere. The solution to this problem seems to be intrinsically algorithmic. – logi-kal Jul 12 '16 at 13:43
@horcrux You may well be right although Ruud's answer does look tantalisingly close. You may see an algorithmic version of this question soon on SO :) – Simd Jul 12 '16 at 13:44
Cite your sources, the engine can be either as well as hybrid... and back references do matter but not alone.. .you can combine the matches for further processing. – Jay Jul 12 '16 at 20:22

score 0 · Answer 3 · answered Jul 12 '16 at 20:28

This kind of depends on your input criteria... There is no infinite string of characters.. using back references you will be able to create a suitable representation of the last amount of occurrences of the pattern you wish to match. \ Personally I would define buckets of length of input and then fill them.

I would then use automata to find patterns in the buckets and then finally coalesce them into larger patterns.

It's not how fast the RegEx is going to be in this case it's how fast you are going to be able to recognize a pattern and eliminate the invalid criterion.

A regex for maximal periodic substrings

3 Answers3

Linked