This is a follow up to A regex to detect periodic strings .
A period
p
of a stringw
is any positive integerp
such thatw[i]=w[i+p]
whenever both sides of this equation are defined. Letper(w)
denote the size of the smallest period ofw
. We say that a stringw
is periodic iffper(w) <= |w|/2
.
So informally a periodic string is just a string that is made up from a another string repeated at least once. The only complication is that at the end of the string we don't require a full copy of the repeated string as long as it is repeated in its entirety at least once.
For, example consider the string x = abcab
. per(abcab) = 3
as x[1] = x[1+3] = a
, x[2]=x[2+3] = b
and there is no smaller period. The string abcab
is therefore not periodic. However, the string ababa
is periodic as per(ababa) = 2
.
As more examples, abcabca
, ababababa
and abcabcabc
are also periodic.
@horcruz, amongst others, gave a very nice regex to recognize a periodic string. It is
\b(\w*)(\w+\1)\2+\b
I would like to find all maximal periodic substrings in a longer string. These are sometimes called runs in the literature.
Formally a substring
w
is a maximal periodic substring if it is periodic and neitherw[i-1] = w[i-1+p]
norw[j+1] = w[j+1-p]
. Informally, the "run" cannot be contained in a larger "run" with the same period.
The four maximal periodic substrings (runs) of string
T = atattatt
areT[4,5] = tt
,T[7,8] = tt
,T[1,4] = atat
,T[2,8] = tattatt
.The string
T = aabaabaaaacaacac
contains the following 7 maximal periodic substrings (runs):T[1,2] = aa
,T[4,5] = aa
,T[7,10] = aaaa
,T[12,13] = aa
,T[13,16] = acac
,T[1,8] = aabaabaa
,T[9,15] = aacaaca
.The string
T = atatbatatb
contains the following three runs. They are:T[1, 4] = atat
,T[6, 9] = atat
andT[1, 10] = atatbatatb
.
Is there a regex (with backreferences) that will capture all maximal periodic substrings?
I don't really mind which flavor of regex but if it makes a difference, anything that the Python module re
supports. However I would even be happy with PCRE if that makes the problem solvable.
(This question is partly copied from https://codegolf.stackexchange.com/questions/84592/compute-the-maximum-number-of-runs-possible-for-as-large-a-string-as-possible . )
Let's extend the regex version to the very powerful https://pypi.python.org/pypi/regex . This supports variable length lookbehinds for example.