362

I'm looking for a way to test whether or not a given string repeats itself for the entire string or not.

Examples:

[
    '0045662100456621004566210045662100456621',             # '00456621'
    '0072992700729927007299270072992700729927',             # '00729927'
    '001443001443001443001443001443001443001443',           # '001443'
    '037037037037037037037037037037037037037037037',        # '037'
    '047619047619047619047619047619047619047619',           # '047619'
    '002457002457002457002457002457002457002457',           # '002457'
    '001221001221001221001221001221001221001221',           # '001221'
    '001230012300123001230012300123001230012300123',        # '00123'
    '0013947001394700139470013947001394700139470013947',    # '0013947'
    '001001001001001001001001001001001001001001001001001',  # '001'
    '001406469760900140646976090014064697609',              # '0014064697609'
]

are strings which repeat themselves, and

[
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

are examples of ones that do not.

The repeating sections of the strings I'm given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can't see any intuitive solution.

I've looked into regexes a bit and they seem good for when you know what you're looking for, or at least the length of the pattern you're looking for. Unfortunately, I know neither.

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?

John
  • 15,418
  • 12
  • 44
  • 65
  • You could try using Burrows-Wheeler transform for that. – Ashalynd Apr 06 '15 at 23:12
  • 16
    *looping through each character trying to build a pattern then checking the pattern vs the rest of the string **seems** awful slow* - but is it? – Tim Apr 06 '15 at 23:19
  • 1
    @TimCastelijns Admittedly, I have not actually tested that. – John Apr 06 '15 at 23:27
  • What if the length of the repeating pattern is 1? You should specify the result you want for, e.g., `'aaa'` as input; two of the best answers so far handle this case differently. – Air Apr 06 '15 at 23:37
  • @Air Each of my examples gave the result I want, and a string like `'aaa'` will never be given. – John Apr 06 '15 at 23:39
  • 2
    possible duplicate of [Writing a regex to detect repeat-characters](http://stackoverflow.com/questions/17793962/writing-a-regex-to-detect-repeat-characters) – Avinash Raj Apr 06 '15 at 23:41
  • 2
    @AvinashRaj That's only matching part of a string, not the full thing. – John Apr 06 '15 at 23:48
  • really? use this `r"(\w+)\1$"` regex in re.match function. – Avinash Raj Apr 06 '15 at 23:49
  • 1
    @AvinashRaj Sorry, I meant the *question* was about that, which makes it technically not a duplicate, I think. I will try those answers when I get a chance next – John Apr 07 '15 at 00:49
  • 12
    @AvinashRaj The OP is asking about all possible solutions. The question you link to only accepts *regex* solution. Note that regex may be able to solve the problem but in *much* more time than necessary. For example an optimal solution (i.e. linear time) would use the suffix tree of the text. You just have to find the longest repeated substring and do some checks on the lengths. – Bakuriu Apr 07 '15 at 16:28
  • Any idea when we'll see the results of each method applied to the real dataset? – TigerhawkT3 Apr 07 '15 at 18:42
  • 2
    @TigerhawkT3 The real dataset is far too large and unwieldy, but the examples in the question are part of it, and if you wish, [here's some more](http://paste.ubuntu.com/10765231/). – John Apr 07 '15 at 18:48
  • I didn't mean the raw output, just the performance results (timing and accuracy). The performance tests here are getting pretty different results depending on the strings used. – TigerhawkT3 Apr 07 '15 at 19:03
  • @nhahtdh re: http://stackoverflow.com/posts/29481088/revisions – this question *is* about pattern-matching; whether a string consists of repeating substrings fits the definition "whether a data structure conforms to a certain pattern" in the tag wiki. Although one of the answers uses regular expressions, the rest do not. – Zero Piraeus Apr 08 '15 at 14:22
  • @ZeroPiraeus: I'm not removing the tag based on the fact that it is answered by regex, but due to my understanding of pattern matching in some languages like Haskell or Mathematica. However, after checking Wikipedia, it seems that the definition of pattern matching has a much wider coverage, and the case in the question is indeed a form of pattern matching. – nhahtdh Apr 08 '15 at 14:40
  • 1
    @John, which version did you end up using for your application? What was the execution time? – TigerhawkT3 Apr 09 '15 at 03:03
  • @AvinashRaj: Guess what regex is doing behind the scene? it is parsing the string and stores each element until a posible reapeatition match occurs. And then checks for it. Thats the same what OP statet as too slow. So I don't think its too slow at all, by looping through. But regex doesn't change anything in view of OP at all. – dhein Apr 09 '15 at 07:59
  • @Zaibis then why he accepted a regex solution? – Avinash Raj Apr 09 '15 at 08:02
  • @AvinashRaj: Because he has obvious no clue of low level processing (and probably that each module is broken down just assempler instructions which end in any case in looping arround) and/or of what he asked for. – dhein Apr 09 '15 at 08:05
  • 1
    @John It is of course entirely your call, but I think [David Zhang's answer](http://stackoverflow.com/a/29489919) deserves the accept more than mine, since it's so obviously [superior](http://stackoverflow.com/a/29482936). Not that I *mind* the upvotes I keep getting, you understand, but this is starting to get a little embarrassing ;-) – Zero Piraeus Apr 10 '15 at 12:15
  • @TigerhawkT3 I ended up using Zero's method (David's not being there yet) for a total running time of 0.68541097641 – John Apr 10 '15 at 20:05
  • 2
    I guess this is a good demonstration of the importance of trying whatever quick-and-dirty way you can come up with rather than engaging in premature optimization. :) – TigerhawkT3 Apr 10 '15 at 20:25
  • you don't need all of that why don't you use a simple hash set() this way you make sure there is not the same string twice – Arnaud Aliès Apr 23 '15 at 21:21
  • Have a look at http://stackoverflow.com/questions/34547126/finding-periodic-strings-using-string-functions/34547912#34547912 for a nice algorithm solving (in JavaScript) a very similar problem (the only difference being that only a boolean return value was required) – Walter Tross Jan 06 '16 at 01:09

14 Answers14

597

Here's a concise solution which avoids regular expressions and slow in-Python loops:

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

See the Community Wiki answer started by @davidism for benchmark results. In summary,

David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.

(That answer's words, not mine.)

This is based on the observation that a string is periodic if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of the first occurrence of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python's string.find.

Community
  • 1
  • 1
David Zhang
  • 4,314
  • 1
  • 14
  • 18
  • This would be a good initial check to include in the other methods so that they can immediately return `None` for non-repeating strings. – TigerhawkT3 Apr 07 '15 at 18:11
  • 19
    You can extend this to find the shortest repeating subsequence by using `.find()` or `.index()` instead of `in`, eg. `(s+s).find(s, 1, -1)`. – Aleksi Torhamo Apr 08 '15 at 11:25
  • 11
    Also, I think `(s+s).find(s, 1, -1)` will be (very slightly) faster than `(s+s)[1:-1].find(s)`, at least for larger strings, since the slicing means you have to create another copy of (nearly) the entire string. – Aleksi Torhamo Apr 08 '15 at 13:47
  • @AleksiTorhamo Oh, nice catch. I was unaware `string.find` could take optional `start` and `end` arguments. Thanks for the improvements! – David Zhang Apr 08 '15 at 13:52
  • Just by the way here, but you can just do `if i!=1: return s[:i]`. If a function call doesn't `return` anything, it evaluates to `None` by default. – TigerhawkT3 Apr 08 '15 at 21:19
  • 2
    It kind of reminds me of this answer in terms of sheer elegance: http://stackoverflow.com/a/2553533/2653390 I guess in general, people should pay more attention to strings concatenated to themselves. There is something really beautiful about prefixes, suffixes, and rotations... – Shashank Apr 08 '15 at 21:19
  • 13
    It's like if you take a sin or cos wave from a [periodic function](http://en.wikipedia.org/wiki/Periodic_function) and shift it to the right. Since it's fully periodic, the waves will eventually match up perfectly...The math parallels to this solution are just phenomenal. :) I wish I could give you +∞ upvotes. – Shashank Apr 08 '15 at 21:52
  • @TigerhawkT3 Maybe it's just me, but I prefer to `return None` explicitly if it's part of the intended behavior of the function. As *The Zen of Python* says, "Explicit is better than implicit." – David Zhang Apr 09 '15 at 02:32
  • I consider it to be in line with `if not string` instead of `if string == ''`: taken care of by the language so that we don't have to. Just my opinion, though. – TigerhawkT3 Apr 09 '15 at 02:45
  • 6
    Guido's [recent update](https://mail.python.org/pipermail/python-dev/2015-April/139054.html) to [PEP 8](https://www.python.org/dev/peps/pep-0008/) is relevant here: "Be consistent in return statements. Either all return statements in a function should return an expression, or none of them should. **If any return statement returns an expression,** any return statements where no value is returned should explicitly state this as return None, and **an explicit return statement should be present at the end of the function** (if reachable)." – Zero Piraeus Apr 09 '15 at 03:50
  • Anyone please explain why its (s+s) not s? – ajknzhol Apr 10 '15 at 14:57
  • 3
    @ajkumar25 Because you need to start at an offset of at least 1 (to stop it matching the trivial rotation - itself) and some string contain a string larger than it so it needs to be extended. – Veedrac Apr 10 '15 at 15:01
  • If anyone still thinks that this code is obfuscated then I will post the deobfuscated version of the same :) – haccks Apr 11 '15 at 19:27
  • What does the word "rotation" mean in this context? – Wayne Conrad Apr 11 '15 at 19:35
  • @WayneConrad; Don't lie please. Tell me that along with "rotation" you also didn't get "principle period" and how actually is `i = (s+s).find(s, 1, -1)` doing magic? I will explain :) – haccks Apr 11 '15 at 19:40
  • @haccks Yes, I got how it works. It's incredibly pretty. I've burned it into memory because it's not just a Python trick--it'll work in Ruby just as well. I just don't see anything rotating there... – Wayne Conrad Apr 11 '15 at 19:58
  • @WayneConrad; If you already got how it works then I will say the three line of code is just an implementation of that paragraph. Non-trivial rotation means if you rotate the string `abcabcabcabcabc` by one character three times then you will get the same string after the rotation. – haccks Apr 11 '15 at 20:04
  • 8
    @WayneConrad Take a string, say, `"abcd"`, pop off the character on the right, and stick it back on to the left to obtain `"dabc"`. This procedure is called *rotating a string to the right by 1 character*. Repeat `n` times to rotate a string to the right by `n` characters. Now observe that if we have a string of `k` characters, rotating to the right by any multiple of `k` leaves the string unchanged. A *nontrivial* rotation of a string is one whose character number is not a multiple of the string's length. – David Zhang Apr 11 '15 at 20:15
  • So what about if we have `abcabc`? – Mazdak Apr 15 '15 at 16:20
  • @Kasra `"abcabc"` is periodic precisely because it is equal to a nontrivial rotation of itself (e.g. rotation to the right by 3). – David Zhang Apr 15 '15 at 17:24
  • @DavidZhang I know that, but based on your code it will return None because it will not find s in `bcab`!!!! – Mazdak Apr 15 '15 at 17:31
  • 4
    @Kasra My code doesn't look for occurrences of `"abcabc"` in `"bcab"`, but in `"bcabcabcab"`. Note that it says `(s+s).find`, not `s.find`. – David Zhang Apr 15 '15 at 18:21
  • @DavidZhang Ooops! yeah very simple! – Mazdak Apr 15 '15 at 18:24
  • If I want to check if there is a point in a number where a part repeats until the end of the string sth. like 0.123413135351515151... where the 51 repeats. Do I have to make a loop and run this script for every substring `s[l:]` or is there a faster solution? – Wikunia Apr 23 '15 at 18:51
  • Knuth–Morris–Pratt algorithm, http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm, complexity O(n) – The Demz Apr 23 '15 at 20:37
  • I think `i = s.find(s[:len(s)/2], 1, -1)` could be faster since you have smaller string to match – Joon Hong Apr 24 '15 at 04:39
  • Hi. Isn't a suffix tree algorithm a faster implementation here? – deostroll Apr 24 '15 at 07:17
  • I have adapted this algorithm to JavaScript: http://stackoverflow.com/questions/34547126/… BTW, since I had to replace `s+s` with `s+s[:something]` (because in JavaScript there is no equivalent of the 3rd parameter of `find`) I optimized it to `s+s[:len(s)//2]`. It would be interesting to check whether `(s+s[:len(s)//2]).find(s, 1)` is faster than `(s+s).find(s, 1, -1)`. – Walter Tross Jan 07 '16 at 17:10
  • 3
    I just came here to say that this is one of the most fascinating things i've ever encountered in Computer Science. It may be because my programming days are long behind this Engineering Director. but what a cool observation! Thank you for giving me this endless fascination. – Shaun Feb 11 '20 at 19:08
185

Here's a solution using regular expressions.

import re

REPEATER = re.compile(r"(.+?)\1+$")

def repeated(s):
    match = REPEATER.match(s)
    return match.group(1) if match else None

Iterating over the examples in the question:

examples = [
    '0045662100456621004566210045662100456621',
    '0072992700729927007299270072992700729927',
    '001443001443001443001443001443001443001443',
    '037037037037037037037037037037037037037037037',
    '047619047619047619047619047619047619047619',
    '002457002457002457002457002457002457002457',
    '001221001221001221001221001221001221001221',
    '001230012300123001230012300123001230012300123',
    '0013947001394700139470013947001394700139470013947',
    '001001001001001001001001001001001001001001001001001',
    '001406469760900140646976090014064697609',
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

for e in examples:
    sub = repeated(e)
    if sub:
        print("%r: %r" % (e, sub))
    else:
        print("%r does not repeat." % e)

... produces this output:

'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.

The regular expression (.+?)\1+$ is divided into three parts:

  1. (.+?) is a matching group containing at least one (but as few as possible) of any character (because +? is non-greedy).

  2. \1+ checks for at least one repetition of the matching group in the first part.

  3. $ checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and using re.match() ensures that there's no non-repeating text before the repeated substrings).

In Python 3.4 and later, you could drop the $ and use re.fullmatch() instead, or (in any Python at least as far back as 2.3) go the other way and use re.search() with the regex ^(.+?)\1+$, all of which are more down to personal taste than anything else.

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
  • 7
    I have no idea why this concise two liner is not the highest-voted answer. The other answers are not bad, but this one is far better. (It may use the frequently denigrated *regular expression*, but I can inspect this far easier than the other much longer answers, which are full of nested blocks, potential typos, off-by-one errors, etc.) Ah well, worse is better I suppose. – Paul Draper Apr 07 '15 at 20:46
  • 10
    I think there are two main reasons for that: 1) some programmers like math more than they like regexes, and 2) since varying the length and nature of the input strings makes different answers pull ahead on performance, the super-long edge case strings (which may not even appear in the real data) make this solution appear suboptimal. – TigerhawkT3 Apr 07 '15 at 21:26
  • sometimes you encounter great prejudice towards regular expressions. Ive had 2 managers that forbade the use of regular expressions because they had heard regular expressons are the wrong tool for the job. Except offcourse then they proceeded by asking me to implement an regexp engine – joojaa Apr 08 '15 at 14:50
  • 1
    @PaulDraper: Guess what regex is doing behind the scene? it is parsing the string and stores each element until a posible reapeatition match occurs. Thats the same what OP statet as too slow. so just because it is a 2 liner there isn't any performance win. – dhein Apr 09 '15 at 07:54
  • 2
    @Zaibis, I'd normally agree, but this is both the shortest and fastest solution (http://stackoverflow.com/a/29482936/1212596)....Except for David's, which was posted after I made that comment. I actually like David's approach more (clever!). – Paul Draper Apr 09 '15 at 08:58
  • The fact that this popular solution has extremely bad performance for some edge cases is an indication that it is not a good answer. Regexes are a great tool and should be used when they are the right tool. – jwg Apr 10 '15 at 11:42
  • This one also works great if you're trying to remove duplicate lines (including multi-line sequences). Just add the MULTILINE and DOTALL flags. It will skip duplicate sequences within a single line. The $ is critical! – Jonah Sep 04 '20 at 16:15
91

You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1 to n / 2 inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:

from math import sqrt, floor

def divquot(n):
    if n > 1:
        yield 1, n
    swapped = []
    for d in range(2, int(floor(sqrt(n))) + 1):
        q, r = divmod(n, d)
        if r == 0:
            yield d, q
            swapped.append((q, d))
    while swapped:
        yield swapped.pop()

def repeats(s):
    n = len(s)
    for d, q in divquot(n):
        sl = s[0:d]
        if sl * q == s:
            return sl
    return None

EDIT: In Python 3, the / operator has changed to do float division by default. To get the int division from Python 2, you can use the // operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.

The // operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using all and a generator expression.

UPDATE: In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None if it does not. @godlygeek has suggested using divmod to reduce the number of iterations on the divisors generator, and the code has been updated to match that as well. It now returns all positive divisors of n in ascending order, exclusive of n itself.

Further update for high performance: After multiple tests, I've come to the conclusion that simply testing for string equality has the best performance out of any slicing or iterator solution in Python. Thus, I've taken a leaf out of @TigerhawkT3 's book and updated my solution. It's now over 6x as fast as before, noticably faster than Tigerhawk's solution but slower than David's.

Shashank
  • 13,713
  • 5
  • 37
  • 63
  • 3
    This solution is amazing. You could change the divisors method to follow the produce-consumer pattern. So that it yield's results as they are found. Will be a shame if this isn't the highest answer. Everything else is brute force. – JustinDanielson Apr 06 '15 at 23:33
  • 3
    @JustinDanielson It returns a generator object created from a generator expression, which is an implicit producer :) It will lazy-evaluate the divisors. – Shashank Apr 06 '15 at 23:34
  • 1
    Ohh. I didn't know that. Well, even better then. :D I understand why you want to avoid sqrt, but you could use n/2 as the upper bound for the divisor range. – JustinDanielson Apr 06 '15 at 23:36
  • 1
    @JustinDanielson Thanks for the suggestion, the range upper bound is now `(n/2)` inclusive. – Shashank Apr 06 '15 at 23:42
  • 1
    Should `n / 2` in `divisors()` be `n // 2`? – TigerhawkT3 Apr 06 '15 at 23:55
  • @TigerhawkT3 In Python 2, it should be `/`, and in Python 3, it should be `//` since we want integer style division. Thanks for bringing it to my attention, I've made an edit. – Shashank Apr 07 '15 at 00:01
  • @TigerhawkT3 Yes, but the `range` needs to include that number so the second argument to `range` should be `int(n / 2) + 1`. – Shashank Apr 07 '15 at 00:04
  • Yes; I was only referring to changing `n / 2` to `int(n / 2)` without removing or changing anything around it. But, if `//` produces integer division in both 2 and 3, then that's probably better than casting to `int()`. – TigerhawkT3 Apr 07 '15 at 00:12
  • @TigerhawkT3 You are right, to get more compatibility between versions, I should use `//`. I've updated the answer now to make it a bit faster. – Shashank Apr 07 '15 at 00:16
  • It seems to have changed back to `n / 2` in the latest update? – TigerhawkT3 Apr 07 '15 at 00:41
  • @TigerhawkT3 Ugh, I am dumb :( I didn't realize that I could use `all` instead of `any`, okay now I think it's really finalized...And thanks for spotting that. :) – Shashank Apr 07 '15 at 00:48
  • 1
    I didn't even know `all` and `any` existed until a couple weeks ago. You should've seen the wheels I reinvented. – TigerhawkT3 Apr 07 '15 at 00:51
  • Why would a length of 1 not handled? It's just a reoating sequence of length one. Nothing really special. Am I missing anything? – Tarik Apr 07 '15 at 02:14
  • This can actually be made faster by `math.floor(math.sqrt(n))` instead of `n // 2` unless I'm missing something. – Ryan Dougherty Apr 07 '15 at 16:41
  • 1
    @Ryan I can't prove it theoretically but I can offer a counterexample as to why `math.sqrt` wouldn't work. A string of length 12 can be a repeated sequence of length 4 or a repeated sequence of length 6, but the square root of 12 is ~3.46. – Shashank Apr 07 '15 at 16:45
  • @Shashank Thanks, I was thinking of mathematical divisors - this is a different use-case. It may be useful to edit the name of the function as to not confuse anyone else. – Ryan Dougherty Apr 07 '15 at 16:48
  • @Ryan But...the generator produces a subset of all mathematical divisors of any given `n`, specifically: all positive divisors of a given `n` except for `1` and `n`. Do you have a suggestion for a better name? `divisors` was the best one I could think of, even if it is a little unspecific. – Shashank Apr 07 '15 at 16:53
  • @Shashank Checking the numbers between sqrt(n) and n/2 is duplicating work, because each of them corresponds to a number between 2 and sqrt(n) that you already found. For your example, 12 == 6 * 2 - but you already found out that 2 divides 6, so you could have yielded both 2 and 6 at that point, and not needed to redo the division at 6. – godlygeek Apr 07 '15 at 18:09
  • I haven't checked whether it's actually any faster, but here's a version of "divisors" that stops the range at sqrt(n): `divisors = lambda n: (d for q1, (q2, r) in ((i, divmod(n,i)) for i in range(2, int(n**.5+1))) if r == 0 for d in set((q1,q2)))` – godlygeek Apr 07 '15 at 18:53
  • @godlygeek The code has been updated with your suggestion. Also, it will now always find the smallest repeating substring first. :) – Shashank Apr 07 '15 at 20:13
  • 1
    @Shashank I've updated the benchmark answer to reflect your improvements. – Zero Piraeus Apr 24 '15 at 18:31
89

Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

Some functions were modified to work with Python 3 (mainly by replacing / with // to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.

In summary: there's about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here (via this comment). David Zhang's solution is the clear winner, outperforming all others by around 5x for the large example set.

A couple of the answers are very slow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

Here are the results, including plots made using matplotlib and seaborn to show the different distributions:


Corpus 1 (supplied examples - small set)

mean performance:
 0.0003  david_zhang
 0.0009  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0015  carpetpython
 0.0029  tigerhawk_1
 0.0031  davidism
 0.0035  saksham
 0.0046  shashank
 0.0052  riad
 0.0056  piotr

median performance:
 0.0003  david_zhang
 0.0008  zero
 0.0013  antti
 0.0013  tigerhawk_2
 0.0014  carpetpython
 0.0027  tigerhawk_1
 0.0031  davidism
 0.0038  saksham
 0.0044  shashank
 0.0054  riad
 0.0058  piotr

Corpus 1 graph


Corpus 2 (supplied examples - large set)

mean performance:
 0.0006  david_zhang
 0.0036  tigerhawk_2
 0.0036  antti
 0.0037  zero
 0.0039  carpetpython
 0.0052  shashank
 0.0056  piotr
 0.0066  davidism
 0.0120  tigerhawk_1
 0.0177  riad
 0.0283  saksham

median performance:
 0.0004  david_zhang
 0.0018  zero
 0.0022  tigerhawk_2
 0.0022  antti
 0.0024  carpetpython
 0.0043  davidism
 0.0049  shashank
 0.0055  piotr
 0.0061  tigerhawk_1
 0.0077  riad
 0.0109  saksham

Corpus 1 graph


Corpus 3 (edge cases)

mean performance:
 0.0123  shashank
 0.0375  david_zhang
 0.0376  piotr
 0.0394  carpetpython
 0.0479  antti
 0.0488  tigerhawk_2
 0.2269  tigerhawk_1
 0.2336  davidism
 0.7239  saksham
 3.6265  zero
 6.0111  riad

median performance:
 0.0107  tigerhawk_2
 0.0108  antti
 0.0109  carpetpython
 0.0135  david_zhang
 0.0137  tigerhawk_1
 0.0150  shashank
 0.0229  saksham
 0.0255  piotr
 0.0721  davidism
 0.1080  zero
 1.8539  riad

Corpus 3 graph


The tests and raw results are available here.

Community
  • 1
  • 1
davidism
  • 121,510
  • 29
  • 395
  • 339
37

Non-regex solution:

def repeat(string):
    for i in range(1, len(string)//2+1):
        if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
            return string[0:i]

Faster non-regex solution, thanks to @ThatWeirdo (see comments):

def repeat(string):
    l = len(string)
    for i in range(1, len(string)//2+1):
        if l%i: continue
        s = string[0:i]
        if s*(l//i) == string:
            return s

The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. Regardless, it's reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.

Test:

print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))

Results:

009
2547
abcde
None
None
None
TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
  • Isn't this a brute force solution? – JustinDanielson Apr 06 '15 at 23:23
  • 7
    @JustinDanielson Yes. But a solution none the less. – Sinkingpoint Apr 06 '15 at 23:37
  • 3
    I'm seeing about 1e-5 to 3e-5 seconds for short strings, 3e-5 to 4e-5 seconds for successful long (1000-character) strings, and a little bit under a millisecond for unsuccessful long strings (worst case). So a thousand 1000-character strings would take about a second. Compared to the math answer, this finds matches 10 times faster, but takes 3 times longer to fail. – TigerhawkT3 Apr 06 '15 at 23:54
  • `repeat('aa')` returns `None` – Tom Cornebize Apr 07 '15 at 09:52
  • 2
    `len(string[0:i])` is always equal to `i` (in this case at least). Replacing these, and also saving `len(string)` and `string[0:i]` in variables might speed things up. Also IMO this is a great solution, awesome ;) – assembly_wizard Apr 07 '15 at 20:25
  • For some reason, I was assigning `l` inside the loop even though it stays the same throughout the duration of the function. I have fixed that. Now it's about 20% faster, on average. – TigerhawkT3 Apr 07 '15 at 23:54
  • @TigerhawkT3 your non-regex solution is the almost the same as mine, but you copy the candidate string needlessly in case of "no match is possible" – Antti Haapala -- Слава Україні Apr 08 '15 at 20:29
  • I must admit, this is one of two times that I've tried to optimize for performance in the last year. I'm just happy that this simple, clear algorithm ended up working so well. :) The new edit with the `continue` statement should be about 15% faster than previously. – TigerhawkT3 Apr 08 '15 at 20:45
  • I think that after I put a `continue` statement in here, and if you fix your answer to go to `length // 2 - 1` just like I had to (to handle strings that repeat exactly one), we will have precisely identical solutions. Pretty cool. – TigerhawkT3 Apr 08 '15 at 21:14
24

First, halve the string as long as it's a "2 part" duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2 need to be tested since anything over that would have no repeats.

def shortest_repeat(orig_value):
    if not orig_value:
        return None

    value = orig_value

    while True:
        len_half = len(value) // 2
        first_half = value[:len_half]

        if first_half != value[len_half:]:
            break

        value = first_half

    len_value = len(value)
    split = value.split

    for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
        if not any(split(value[:i])):
            return value[:i]

    return value if value != orig_value else None

This returns the shortest match or None if there is no match.

davidism
  • 121,510
  • 29
  • 395
  • 339
16

This version tries only those candidate sequence lengths that are factors of the string length; and uses the * operator to build a full-length string from the candidate sequence:

def get_shortest_repeat(string):
    length = len(string)
    for i in range(1, length // 2 + 1):
        if length % i:  # skip non-factors early
            continue

        candidate = string[:i]
        if string == candidate * (length // i):
            return candidate

    return None

Thanks to TigerhawkT3 for noticing that length // 2 without + 1 would fail to match the abab case.

  • This solution is indeed virtually identical to my optimized one. I see that you have a `range` limit of `length//2`, just like I did - you have to change that to `length//2+1` if you want to catch strings that are exactly doubled (e.g. `'aabaab'`). – TigerhawkT3 Apr 08 '15 at 20:53
  • And now they're identical! \o/ I need to pay more attention to optimization in the future, but I'm glad the algorithm itself was sound. – TigerhawkT3 Apr 09 '15 at 07:06
16

The problem may also be solved in O(n) in worst case with prefix function.

Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n, but usually find fails sooner, I think one of bad cases for them will be aaa....aab, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1 a's

First of all you need to calculate prefix function

def prefix_function(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    return pi

then either there's no answer or the shortest period is

k = len(s) - prefix_function(s[-1])

and you just have to check if k != n and n % k == 0 (if k != n and n % k == 0 then answer is s[:k], else there's no answer

You may check the proof here (in Russian, but online translator will probably do the trick)

def riad(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    k = n - pi[-1]
    return s[:k] if (n != k and n % k == 0) else None
RiaD
  • 46,822
  • 11
  • 79
  • 123
  • Your `prefix_function()` is not valid Python: you have missing colons on your `while` and `if` statements, and `&&` instead of `and`. After fixing those, it fails with `UnboundLocalError: local variable 'i' referenced before assignment` because of the line `for i in range(i, n):`. – Zero Piraeus Apr 08 '15 at 16:40
  • Thanks :-) If you can put together a function that uses your `prefix_function()` to return similar results to the other answers – either the shortest substring or `None` – I'll include it in a revised benchmark I'm putting together. – Zero Piraeus Apr 08 '15 at 16:47
  • @ZeroPiraeus, It works fine actually, I just called it in a wrong way – RiaD Apr 08 '15 at 17:01
15

Here's a straight forward solution, without regexes.

For substrings of s starting from zeroth index, of lengths 1 through len(s), check if that substring, substr is the repeated pattern. This check can be performed by concatenating substr with itself ratio times, such that the length of the string thus formed is equal to the length of s. Hence ratio=len(s)/len(substr).

Return when first such substring is found. This would provide the smallest possible substring, if one exists.

def check_repeat(s):
    for i in range(1, len(s)):
        substr = s[:i]
        ratio = len(s)/len(substr)
        if substr * ratio == s:
            print 'Repeating on "%s"' % substr
            return
    print 'Non repeating'

>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"
Saksham Varma
  • 2,122
  • 13
  • 15
  • Now that I look at this one carefully, it seems to be nearly identical to my originally posted (before any edits) solution, with the only differences being saving the length and the substring. I guess I had a pretty good algorithm. :P – TigerhawkT3 Apr 10 '15 at 20:16
9

I started with more than eight solutions to this problem. Some were bases on regex (match, findall, split), some of string slicing and testing, and some with string methods (find, count, split). Each had benefits in code clarity, code size, speed and memory consumption. I was going to post my answer here when I noticed that execution speed was ranked as important, so I did more testing and improvement to arrive at this:

def repeating(s):
    size = len(s)
    incr = size % 2 + 1
    for n in xrange(1, size//2+1, incr):
        if size % n == 0:
            if s[:n] * (size//n) == s:
                return s[:n]

This answer seems similar to a few other answers here, but it has a few speed optimisations others have not used:

  • xrange is a little faster in this application,
  • if an input string is an odd length, do not check any even length substrings,
  • by using s[:n] directly, we avoid creating a variable in each loop.

I would be interested to see how this performs in the standard tests with common hardware. I believe it will be well short of David Zhang's excellent algorithm in most tests, but should be quite fast otherwise.

I found this problem to be very counter-intuitive. The solutions I thought would be fast were slow. The solutions that looked slow were fast! It seems that Python's string creation with the multiply operator and string comparisons are highly optimised.

Logic Knight
  • 289
  • 4
  • 10
  • [Not bad at all :-)](http://stackoverflow.com/a/29482936) The benchmark runs on Python 3.4 (partly because OP doesn't specify a version and that's what everyone *should* be using, and partly because it uses the new [`statistics`](https://docs.python.org/3/library/statistics.html) module), so I had to change your `/`s to `//`s and replace `xrange()` with `range()` (which behaves like 2.x's `xrange()` in 3.x). – Zero Piraeus Apr 09 '15 at 16:38
  • Here are the [revisions](https://bitbucket.org/snippets/schesis/nMnR/revisions/) to the benchmark, so you can review my changes, by the way. – Zero Piraeus Apr 09 '15 at 16:42
  • Thanks Zero. That was fast. The results were slightly down on my predictions. I suspect the techniques I used for speed in Python 2.7 are not very effective in Python 3.4. Oh, well - a fun and educational exercise. – Logic Knight Apr 09 '15 at 16:50
  • `//` in 3.x is integer division (just like the 2.x behaviour of `/`), while 3.x's `/` is float division (which I'm sure would be much slower even if it didn't break your solution by causing an attempt to use a float as an index). As mentioned, 3.x's `range()` is the same thing as 2.x's `xrange()`; no equivalent of 2.x's `range()` exists in 3.x. So I don't think that's the cause of any discrepancy between the benchmark and any timings you made. It's probably just that 3.x is slower overall than 2.x (or maybe your machine is faster than mine). – Zero Piraeus Apr 09 '15 at 16:55
  • When I get some time, I shall have a close look at the run-time differences between Python 2 and Python 3. – Logic Knight Apr 09 '15 at 17:14
  • This is identical to my and @Antti's algorithm. It adds the optimization of giving the `range` a step argument, and omits the optimization of saving the substring instead of calculating it twice. It results in pretty similar performance. – TigerhawkT3 Apr 09 '15 at 18:48
  • *Local* variables aren't that slow in CPython 3, they will be numbered; the number of locals within a function `func` is `func.__code__.co_nlocals`; CPython uses `LOAD_FAST` opcode to load a local variable from storage, which is possibly even faster than stack operations. Furthermore my code, by using the variable, avoids slicing the string twice. – Antti Haapala -- Слава Україні Apr 15 '15 at 10:22
  • @AnttiHaapala, thanks. I will remember this when writing faster Python. – Logic Knight Apr 15 '15 at 15:06
2

This function runs very quickly (tested and it's over 3 times faster than fastest solution here on strings with over 100k characters and the difference gets bigger the longer the repeating pattern is). It tries to minimise the number of comparisons needed to get the answer:

def repeats(string):
    n = len(string)
    tried = set([])
    best = None
    nums = [i for i in  xrange(2, int(n**0.5) + 1) if n % i == 0]
    nums = [n/i for i in nums if n/i!=i] + list(reversed(nums)) + [1]
    for s in nums:
        if all(t%s for t in tried):
            print 'Trying repeating string of length:', s
            if string[:s]*(n/s)==string:
                best = s
            else:
                tried.add(s)
    if best:
        return string[:best]

Note that for example for string of length 8 it checks only fragment of size 4 and it does not have to test further because pattern of length 1 or 2 would result in repeating pattern of length 4:

>>> repeats('12345678')
Trying repeating string of length: 4
None

# for this one we need only 2 checks 
>>> repeats('1234567812345678')
Trying repeating string of length: 8
Trying repeating string of length: 4
'12345678'
Piotr Dabkowski
  • 5,661
  • 5
  • 38
  • 47
  • Thank you :) I did not optimise it much though. I just wanted to present different approach because other answers are focusing on finding the the pattern and my approach focuses on proving that there is no pattern :) Therefore for random strings my algorithm should run much faster. – Piotr Dabkowski Apr 24 '15 at 21:24
1

In David Zhang's answer if we have some sort of circular buffer this will not work: principal_period('6210045662100456621004566210045662100456621') due to the starting 621, where I would have liked it to spit out: 00456621.

Extending his solution we can use the following:

def principal_period(s):
    for j in range(int(len(s)/2)):
        idx = (s[j:]+s[j:]).find(s[j:], 1, -1)
        if idx != -1:
            # Make sure that the first substring is part of pattern
            if s[:j] == s[j:][:idx][-j:]:
                break

    return None if idx == -1 else s[j:][:idx]

principal_period('6210045662100456621004566210045662100456621')
>>> '00456621'
sachinruk
  • 9,571
  • 12
  • 55
  • 86
0

If you only want to know if a string repeats itself in another string, then this is the best (in my opinion) and shortest snippet for this:

def rep(repeat,fullstring):
    return len(fullstring.split(repeat)) > 2

#rep('test', 'test -others-') - no
#rep('hello', 'hello world') - yes
Blaze612 YT
  • 594
  • 7
  • 13
0

Here's a simple approach towards this question:

def is_repeated(s):
    n = len(s)
    for i in range(1, n//2 + 1):
        if n % i == 0:
            repeated = s[:i] * (n // i)
            if repeated == s:
                return True
    return False
repeated_strings_list = []
for i in list1:
  repeated_strings_list.append(is_repeated(i))

So, this function basically devides the string in half and checks if that half substring is repeating itself in the original string or not. It continues to do so until it reaches the limit, or if the sub-string is repeated.

Here's the example you've provided:

 list1 = [
        '0045662100456621004566210045662100456621',             
        '0072992700729927007299270072992700729927',             
        '001443001443001443001443001443001443001443',           
        '037037037037037037037037037037037037037037037',        
        '047619047619047619047619047619047619047619',           
        '002457002457002457002457002457002457002457',           
        '001221001221001221001221001221001221001221',           
        '001230012300123001230012300123001230012300123',        
        '0013947001394700139470013947001394700139470013947',    
        '001001001001001001001001001001001001001001001001001',  
        '001406469760900140646976090014064697609',              
    ]

And the output you will get with this approach:

 [True, True, True, True, True, True, True, True, True, True, True]
MADHAV G
  • 1
  • 1