Detect repetitions in string

Question

I have a simple problem, but can't come with a simple solution :)

Let's say I have a string. I want to detect if there is a repetition in it.

I'd like:

"blablabla" # => (bla, 3)

"rablabla"  # => (bla, 2)

The thing is I don't know what pattern I am searching for (I don't have "bla" as input).

Any idea?

EDIT:
Seeing the comments, I think I should precise a bit more what I have in mind:

In a string, there is either a pattern that is repeted or not.
The repeted pattern can be of any length.

If there is a pattern, it would be repeted over and over again until the end. But the string can end in the middle of the pattern.

Example:

"testblblblblb" # => ("bl",4)

as per your example, the repetited string is of length 3. So you are looking only strings having length 3 ? — Jayy, Jan 31 '12 at 12:54
Indeed :). Whatever the pattern, as soon as it can be found :). — jlengrand, Jan 31 '12 at 12:54
What's the objective function? The length of the match? The number of repetitions? Some combination of the two? — NPE, Jan 31 '12 at 12:57
Do you allow overlapping matches (e.g. the two `aba`s in `ababa`)? — NPE, Jan 31 '12 at 12:57
objective function is this : http://projecteuler.net/problem=26. But I didn't want to spoil the internet. And yeah, S.Lott spelling mistake. . . — jlengrand, Jan 31 '12 at 13:16
This is probably not the best way to solve Euler #26. You'll have to use the decimal module to handle arbitrary-precision numbers (or some equivalent approach) because 1/19 ~ 0.05263157894736842 as a float, so its repeating part doesn't even fit in a float. Admittedly you can bound the length of the repeating part, so you can make it work, but there are "mathier" ways to do it. — DSM, Jan 31 '12 at 14:27
I totally agree with you, DSM. Problem is : How to find a "mathier" way to do this without falling directly on a complete solution? The problem is specialized enough to lead to a direct solution :s — jlengrand, Jan 31 '12 at 15:03
@DSM. Thk you very much. Half a year later, someone upvoting, your comment, a better knowledge of the Python doc and half an hour of free time and problem is finally solved :D. You definitely led me to the solution — jlengrand, Oct 16 '12 at 17:24

score 45 · Accepted Answer · answered Jan 31 '12 at 12:57

45

import re
def repetitions(s):
   r = re.compile(r"(.+?)\1+")
   for match in r.finditer(s):
       yield (match.group(1), len(match.group(0))/len(match.group(1)))

finds all non-overlapping repeating matches, using the shortest possible unit of repetition:

>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]

answered Jan 31 '12 at 12:57

Tim Pietzcker

328,213
58
503
561

o_O. I should definitely look at that ! – jlengrand Jan 31 '12 at 13:01
I was going to suggest building a graph, but I guess this does the same more efficiently ;) – Marcin Jan 31 '12 at 13:01
3

Isn't this of **O** n! ? I think this is devilish because of the potential computational cost of such a simple-looking construct. – S.Lott Jan 31 '12 at 13:10
2

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have... a really sweet solution. – dabhaid Jan 31 '12 at 13:22
3

In a string with no repeats, it must locate **all** non-empty substrings. Is that n! ? – S.Lott Jan 31 '12 at 13:37
1

@TimPietzcker I think the worst case is when you nearly get a repeat but not quite. So `s = "abacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacabagabacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacabah"` then `for i in range(2,len(s)): print(i, timeit(lambda:list(repetitions(s[:i])), number=1000))` is definitely worse than linear time but I haven't yet figured out what it is. – Duncan Jan 31 '12 at 13:46
1

@S.Lott: Right, that's definitely the worst case and would scale terribly in a long string. Then again, I can't think of a solution that wouldn't run into this problem if any and all parts of the string could possibly be a relevant substring. – Tim Pietzcker Jan 31 '12 at 14:05
3

@TimPietzcker: I think your concern is correct, but too weakly worded. IIRC, there is **no** "solution that wouldn't run into this problem". Stated another way, I think this question is the classic **O** (n!) mistake and there is no sensible algorithm. – S.Lott Jan 31 '12 at 15:12
2

@S.Lott yes, [there is](http://digitool.library.colostate.edu///exlibris/dtl/d3_1/apache_media/L2V4bGlicmlzL2R0bC9kM18xL2FwYWNoZV9tZWRpYS8xNjY2ODE=.pdf) (`O(n log(n))`) – Nicolas Rinaudo Oct 01 '14 at 09:45
2

There is an O(n log n) from 1981 by Crochemore ( http://scholar.google.com/scholar?cluster=8911547273147713204&hl=en&oi=scholarr ) there are more complicated and recent O(n) algorithms where the setup is very heavy (they are based on Lempel-Ziv factorization if I remember) but work quite well on gigabyte strings (think DNA). – Eponymous Dec 02 '16 at 00:22
Fails for ``AssertionError: Error for input ababxab, expected [('ab', 2), ('x', 1), ('ab', 1)] but got [('ab', 2.0)]`` – Tomás Denis Reyes Sánchez Jun 20 '20 at 12:43
@TomásDenisReyesSánchez: I'm only looking for repeated patterns (see the question'S example `"testblblblblb" # => ("bl",4) `), therefore patterns that occur only once will not be counted. Also, if we were to do that, shouldn't the result be `[('ab', 2), ('xab', 1)]`? – Tim Pietzcker Jun 20 '20 at 15:48

Detect repetitions in string

1 Answers1

Linked

Related