Finding the repeating sub string in a big string

Question

Take for example the following strings

0.714285714285714285714285714285714285714285
0.111111111111111111111111111111111111111111
0.166666666666666666666666666666666666666666

I want to find the sub string that is repeating repetition for each.

714285
1
6

How can I do this in python. Using regex is okay, I tried the following:

import re

testString = "0.714285714285714285714285714285714285714285"
print(re.search(r"(.+)\1", testString).group(1))

This gives me the (wrong) output:

714285714285714285

It should be 7814285

How do I fix this? Is there way to improve my regex or is regex the wrong tool for this job? Maybe python has an awesome built in for this? Is there anyway to do use this with or without regex?

EDIT Before posting an answer check with the test case

0.0022271714922048997772828507795100222717149220489977728285077951002227171492204899777282850779510022

It should return 00222717149220489977728285077951

And if the string is: `1122112211221122` what must be the result? — Casimir et Hippolyte, May 31 '16 at 15:12
Possible duplicate of [Find longest repetitive sequence in a string](http://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string) — leovp, May 31 '16 at 15:14
@CasimiretHippolyte I think `1122`. I am trying to find what is the repeating part in a decimal — user89239213892389, May 31 '16 at 15:14
What about `0.714285714285714285714285714285714285714286`? I. e. what if the last repetition is incomplete or rounded? — Tim Pietzcker, May 31 '16 at 15:14
@TimPietzcker That one should still match `7142857` I won't have a situation where it's rounded but it is very likely one to have parts cut off — user89239213892389, May 31 '16 at 15:16
OK, then what about `0.01234567901234567` (=1/81)? Or can you be sure that the repeating part occurs at least twice in its completeness? — Tim Pietzcker, May 31 '16 at 15:18
@TimPietzcker Actually it should match even if repeating part does not occour completely twice — user89239213892389, May 31 '16 at 15:19
Possible duplicate of [How can I tell if a string repeats itself in Python?](http://stackoverflow.com/q/29481088)? — Bhargav Rao, May 31 '16 at 15:28
Out of curiosity, what is this routine being developed for? What are you ultimately trying to make? — amphetamachine, May 31 '16 at 15:32
I don't think it's possible if you don't even have one proper repetition. How should you be able to tell from `0.121` whether it's going to continue as `0.121212` or as `0.12345`? — Tim Pietzcker, May 31 '16 at 15:34
Can you please explain that what's the logic behind your desire outputs. You can give a complete explanation with examples in comments. — Mazdak, May 31 '16 at 15:35
In the projecteuler's problem they show when the number will repeat using parentheses — Washington Guedes, May 31 '16 at 15:36
This question is getting unclear. After your last edit your last string is contradictory with the firsts since the last has a greedy match but you want to get the non greedy for the first cases. How do you define what is greedy and what is non greedy? For instance, for `0.1666` you want `6` but for `002227` you don't them but want the most greedy part? — Federico Piazza, May 31 '16 at 15:36
@BhargavRao This question is not completely about a string that repeats itself. OP wants to find the repeating part in decimal part which can also be contain different strings — Mazdak, May 31 '16 at 15:36
@Kasravand Thanks,I was unsure and hence I did not Hammer. Perhaps the OP can take a cue or two from the linked post. — Bhargav Rao, May 31 '16 at 15:38
@FedericoPiazza I apologize if the questions has gotten convoluted, it's become much more complicated than I imagined. I'm just trying to get the repeating part of a decimal. — user89239213892389, May 31 '16 at 15:38
@TimPietzcker I think you are right, lets assume the string always has atleast 2 repetitions. — user89239213892389, May 31 '16 at 15:39
@user89239213892389 As I told you, before your question gets close, update your question with a complete explanation. — Mazdak, May 31 '16 at 15:41
You've discarded several answers answering your question. I assume that's because you've failed to give all the criteria for your "search". `(.+?)\1` actually works, only it finds the first repeating part, which in, for example `1122112211221122` *is* `11`, even if there is a longer repeat later. In your other example, you want the "short" repeat `714285714` even though there's a longer (which you've already discovered) - `714285714714285714714285714` - since there are more that six+ repetitions of `714285714`. — SamWhan, May 31 '16 at 15:41
@BhargavRao Yeah, but I thinks we should vote to closing the question as unclear :-). — Mazdak, May 31 '16 at 15:41
So based on your last example, the first example should return `714285714285714285`? — Mazdak, May 31 '16 at 15:51
@Kasravand I don't understand what you mean by that, I don't get where the contradiction is? It should always return the repeating part of the decimal. — user89239213892389, May 31 '16 at 16:07
There are a lot of forms of repetition and you didn't clarified what kind of it is your mean. — Mazdak, May 31 '16 at 16:15
@Kasravand I'm using the same definition of repetition as project euler 26. a `recurring cycle in its decimal fraction part` — user89239213892389, May 31 '16 at 16:16
Give a look in this [post.](http://codegolf.stackexchange.com/a/78938/52956) — Washington Guedes, May 31 '16 at 17:28

Casimir et Hippolyte · Accepted Answer · 2016-06-06T22:03:21.703

You can give a try to this pattern:

(?=(\d+)\1+(.*))(\d+?)\3+\2$

demo

or to obtain the substring as whole match (group 0):

(?=(\d+)\1+(.*))(\d+?)(?=\3+\2$)

What does exactly the pattern?

It returns, for a position in the string, the smallest repeated substring that spans the larger part of the string.

How does it work?

In a lookahead is described the largest repeated substring with a greedy quantifier (i.e. (\d+)), followed by its repetitions \1+, followed by the end of the string captured in group 2.

Then, once the lookahead closed, (\d+?)\3+ searches this time the smallest repeated substring with a non-greedy quantifier but with a condition: after the repetitions, the end of the string must be the same than the one captured in the lookahead.

This ensures that the substring in group 3 can't be sliced into a smaller repeated substring.

Results

The searched substring is in the group 3.

If you use the pattern as it (i.e. non-anchored), the first repeated substring on the left is returned.

Obviously, if you only want a result that starts after the dot you need to anchor the pattern with it:

\.(?=(\d+)\1+(.*))(\d+?)\3+\2$ # immediately after the dot

or

\..*?(?=(\d+)\1+(.*))(\d+?)\3+\2$ # the first after the dot

If you want to research repeated substrings for each positions in the string (for example to find the largest whatever the starting position), you need to enclose all the second part in a lookahead too and to use re.findall:

(?=(\d+)\1+(.*))(?=(\d+?)\3+\2$)

(Then feel free to sort the result list, if you want to obtain the largest string whatever the starting position)

@kelalaka: I don't understand your comment. Can you give an example string and explain clearly what you think is wrong or unexpected (and with which pattern)? — Casimir et Hippolyte, Oct 07 '18 at 12:12
for example, if my string all 0 of any length >1, then the period is expected 1. I didn't get the result. — kelalaka, Oct 07 '18 at 13:56
@kelalaka: The patterns (as asked in the question) don't return a length (or a "period") but a match (the repeating substring), and the patterns return the expected result: 0 (in capturing group 3 for the [first pattern](https://regex101.com/r/UXz2hE/2), the whole match for the [second pattern](https://regex101.com/r/UXz2hE/1)). — Casimir et Hippolyte, Oct 07 '18 at 14:10
yes, I tried to use it to determine the period of my sequences with rho structure. — kelalaka, Oct 07 '18 at 14:20

Pedro Lobito · Answer 2 · 2016-05-31T15:56:27.633

0

To match your last update:

(\d\d)(\d+?)(?=\1)

http://ideone.com/OpqJ9c

Note:
The answer below was valid before your last update. You may have to consider using something more than a regex, which isn't a programming language.

import re
testString = "0.714285714285714285714285714285714285714285"
print(re.search(r"(\d)(\d+?)(?=\1)", testString).group(0)) 
#714285

Demo

(\d)(\d+?)(?=\1)

Match the regex below and capture its match into backreference number 1 «(\d)»
   Match a single character that is a “digit” (ASCII 0–9 only) «\d»
Match the regex below and capture its match into backreference number 2 «(\d+?)»
   Match a single character that is a “digit” (ASCII 0–9 only) «\d+?»
      Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=\1)»
   Match the same text that was most recently matched by capturing group number 1 (case insensitive for A-Z; fail if the group did not participate in the match so far) «\1»

edited May 31 '16 at 15:56

answered May 31 '16 at 15:27

Pedro Lobito

94,083
31
258
268

Thanks for the attempt but as you said this won't work for (1/449). I'm open to other answers that don't use regex. – user89239213892389 May 31 '16 at 16:05
You're welcome, next time please post all the examples you need to match at once, otherwise it's a waste of time for everyone, including you. – Pedro Lobito May 31 '16 at 16:07
Yeah I'm sorry. I should have included that one because it's particularly challenging to match properly. But I can't include them all because there's infinitely many decimals (1/1, 1/2, 1/3, 1/4) it just happens 1/449 is extremely tricky. – user89239213892389 May 31 '16 at 16:09
By the way the new version you posted will fail for `.111111111111111` it should return just `1` – user89239213892389 May 31 '16 at 16:19
You'll need to find other tool than regex. – Pedro Lobito May 31 '16 at 16:21
Sorry if my question was unclear, I'm not looking for a regex only solution, I just want *any* solution – user89239213892389 May 31 '16 at 16:23

Finding the repeating sub string in a big string

2 Answers2

Demo