repeated substring in R

Question

So, i was going through the question on the following link: Finding repeated substrings with R

Answer posted is very great but i have a doubt, if string is

"111111111111"

then the output coming is

[1] "1111" "111"  "11"

since, "11" and "111" are included in "1111" would not those two are redundant, how can one overcome this? I wand only "1111" Also, what have i understood from that question is that code must output repeated strings of maximal length. Consider another example:

"101110110110110"

expected output is:

"110" "101" "011"

In that example, only "1111" since, "111" and "11" are included in "1111" — user3845968, Aug 09 '14 at 12:50
kindly read the question on that link. I want to find the pattern from any position in any given string such that the pattern repeats for atleast a threshold number of times. — user3845968, Aug 09 '14 at 12:55
Is the threshold in your examples 3? In the second example "1" is repeated 3 times in a row. Why isn't "1" an output? — GSee, Aug 09 '14 at 13:05
Because desire is of repeated string of maximum length. since 3 > 1, that is why. — user3845968, Aug 09 '14 at 13:07
In your latter example, why not "1011" and "0110" ? I fear your question requires much more detailed definition to be answerable. — Carl Witthoft, Aug 09 '14 at 13:21
everything is mentioned on that link. Minimum repetition i.e. threshold(th) is 3. That is why. — user3845968, Aug 09 '14 at 13:25

Casimir et Hippolyte · Answer 1 · 2014-08-09T14:01:35.723

With a classical pattern, the regex engine can't match the same character at the same position more than once.

Example with the string abcde the pattern abc|bcd will give you only one result (abc) but can't give you a second match for bcd since bc has been already a part of the first match result. In other words you can't obtain overlapping match results, since at the second try the position of the regex engine in the string is after the first match (after abc).

To some significant, it is possible with a trick to obtain overlapping results if you put all the pattern in a lookahead assertion (and if you use a capture group to obtain something). The trick use the fact that a lookahead matches nothing, it's only a test from the current regex engine position, however nothing disallow to put a capture group inside.

Example with abcde and (?=(abc|bcd)): the pattern succeeds at the a position and at the b position, but the two whole match results will be empty, but in the first capture group you will obtain abc and bcd.

However, it's always not possible to obtain several results from the same position in the string, because the regex engine advances to the next character each time it obtains a result (or do not obtain anything at all).

If, with the same string, you tests this pattern (?=(abc|abcd)) you will only obtain abc in the capture group. One more time, the same position can't be tested twice.

about your specific example

Here is the test code:

s <- '111111111111'
m <- gregexpr('(?=(.{2,})\\1{2,})', s, perl=TRUE)

unique(mapply(function(x, y) substr(s, x, x+y-1), 
              attr(m[[1]], 'capture.start'), 
              attr(m[[1]], 'capture.length')))

The goal of the pattern is to find substrings with a length greater than 1 character that is repeated more than 1 time, the repeated substrings must be contiguous.

s contains twelve characters. The regex engine will test the whole pattern at each position as explained above. Since the quantifiers are greedy by default, you will obtain the longest result for each positions.

result:

position : capture group : lookahead whole match
character 1: 1111 : 1111 1111 1111
character 2: 111 : 111 111 111
character 3: 111 : 111 111 111
character 4: 111 : 111 111 111
character 5: 11 : 11 11 11 11
character 6: 11 : 11 11 11
character 7: 11 : 11 11 11

The pattern succeeds seven times. The second part of the code only removes duplicates.

Great explanation. But is there no other method to get the desire output? — user3845968, Aug 09 '14 at 14:05
Also, when i am trying s <- '111111111111'; k <- 2 m <- gregexpr('(?=(.{k,})\\1{2,})', s, perl=TRUE) unique(mapply(function(x, y) substr(s, x, x+y-1), attr(m[[1]], 'capture.start'), attr(m[[1]], 'capture.length'))) output is null. Why? — user3845968, Aug 09 '14 at 14:33
@user3845968: Because `k` in the pattern is seen as a literal `k` obviously. — Casimir et Hippolyte, Aug 09 '14 at 16:08
In second example, there are two outputs of same length, But by removing lookahead its not outputting them. Also that 'k' i want its value to be the length of maximum repeated string, obviously. Please help. — user3845968, Aug 09 '14 at 16:16

repeated substring in R

1 Answers1