29

I have a simple problem, but can't come with a simple solution :)

Let's say I have a string. I want to detect if there is a repetition in it.

I'd like:

"blablabla" # => (bla, 3)

"rablabla"  # => (bla, 2)

The thing is I don't know what pattern I am searching for (I don't have "bla" as input).

Any idea?

EDIT:
Seeing the comments, I think I should precise a bit more what I have in mind:

  • In a string, there is either a pattern that is repeted or not.
  • The repeted pattern can be of any length.

If there is a pattern, it would be repeted over and over again until the end. But the string can end in the middle of the pattern.

Example:

"testblblblblb" # => ("bl",4) 
Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
jlengrand
  • 12,152
  • 14
  • 57
  • 87
  • 3
    Doesn't sound like a very simple problem to me – Hubro Jan 31 '12 at 12:52
  • 14
    I'd say `rablabla` should return `('abl', 2)`, don't you? – Tim Pietzcker Jan 31 '12 at 12:53
  • as per your example, the repetited string is of length 3. So you are looking only strings having length 3 ? – Jayy Jan 31 '12 at 12:54
  • Indeed :). Whatever the pattern, as soon as it can be found :). – jlengrand Jan 31 '12 at 12:54
  • 1
    And I meant simple problem to understand :) – jlengrand Jan 31 '12 at 12:55
  • What's the objective function? The length of the match? The number of repetitions? Some combination of the two? – NPE Jan 31 '12 at 12:57
  • 1
    Do you allow overlapping matches (e.g. the two `aba`s in `ababa`)? – NPE Jan 31 '12 at 12:57
  • objective function is this : http://projecteuler.net/problem=26. But I didn't want to spoil the internet. And yeah, S.Lott spelling mistake. . . – jlengrand Jan 31 '12 at 13:16
  • 4
    This is probably not the best way to solve Euler #26. You'll have to use the decimal module to handle arbitrary-precision numbers (or some equivalent approach) because 1/19 ~ 0.05263157894736842 as a float, so its repeating part doesn't even fit in a float. Admittedly you can bound the length of the repeating part, so you can make it work, but there are "mathier" ways to do it. – DSM Jan 31 '12 at 14:27
  • I totally agree with you, DSM. Problem is : How to find a "mathier" way to do this without falling directly on a complete solution? The problem is specialized enough to lead to a direct solution :s – jlengrand Jan 31 '12 at 15:03
  • @DSM. Thk you very much. Half a year later, someone upvoting, your comment, a better knowledge of the Python doc and half an hour of free time and problem is finally solved :D. You definitely led me to the solution – jlengrand Oct 16 '12 at 17:24

1 Answers1

45
import re
def repetitions(s):
   r = re.compile(r"(.+?)\1+")
   for match in r.finditer(s):
       yield (match.group(1), len(match.group(0))/len(match.group(1)))

finds all non-overlapping repeating matches, using the shortest possible unit of repetition:

>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • o_O. I should definitely look at that ! – jlengrand Jan 31 '12 at 13:01
  • I was going to suggest building a graph, but I guess this does the same more efficiently ;) – Marcin Jan 31 '12 at 13:01
  • 3
    Isn't this of **O** n! ? I think this is devilish because of the potential computational cost of such a simple-looking construct. – S.Lott Jan 31 '12 at 13:10
  • 2
    Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have... a really sweet solution. – dabhaid Jan 31 '12 at 13:22
  • 3
    In a string with no repeats, it must locate **all** non-empty substrings. Is that n! ? – S.Lott Jan 31 '12 at 13:37
  • 1
    @TimPietzcker I think the worst case is when you nearly get a repeat but not quite. So `s = "abacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacabagabacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacabah"` then `for i in range(2,len(s)): print(i, timeit(lambda:list(repetitions(s[:i])), number=1000))` is definitely worse than linear time but I haven't yet figured out what it is. – Duncan Jan 31 '12 at 13:46
  • 1
    @S.Lott: Right, that's definitely the worst case and would scale terribly in a long string. Then again, I can't think of a solution that wouldn't run into this problem if any and all parts of the string could possibly be a relevant substring. – Tim Pietzcker Jan 31 '12 at 14:05
  • 3
    @TimPietzcker: I think your concern is correct, but too weakly worded. IIRC, there is **no** "solution that wouldn't run into this problem". Stated another way, I think this question is the classic **O** (n!) mistake and there is no sensible algorithm. – S.Lott Jan 31 '12 at 15:12
  • 2
    @S.Lott yes, [there is](http://digitool.library.colostate.edu///exlibris/dtl/d3_1/apache_media/L2V4bGlicmlzL2R0bC9kM18xL2FwYWNoZV9tZWRpYS8xNjY2ODE=.pdf) (`O(n log(n))`) – Nicolas Rinaudo Oct 01 '14 at 09:45
  • 2
    There is an O(n log n) from 1981 by Crochemore ( http://scholar.google.com/scholar?cluster=8911547273147713204&hl=en&oi=scholarr ) there are more complicated and recent O(n) algorithms where the setup is very heavy (they are based on Lempel-Ziv factorization if I remember) but work quite well on gigabyte strings (think DNA). – Eponymous Dec 02 '16 at 00:22
  • Fails for ``AssertionError: Error for input ababxab, expected [('ab', 2), ('x', 1), ('ab', 1)] but got [('ab', 2.0)]`` – Tomás Denis Reyes Sánchez Jun 20 '20 at 12:43
  • @TomásDenisReyesSánchez: I'm only looking for repeated patterns (see the question'S example `"testblblblblb" # => ("bl",4) `), therefore patterns that occur only once will not be counted. Also, if we were to do that, shouldn't the result be `[('ab', 2), ('xab', 1)]`? – Tim Pietzcker Jun 20 '20 at 15:48