-1

So, i was going through the question on the following link: Finding repeated substrings with R

Answer posted is very great but i have a doubt, if string is

"111111111111"

then the output coming is

[1] "1111" "111"  "11"  

since, "11" and "111" are included in "1111" would not those two are redundant, how can one overcome this? I wand only "1111" Also, what have i understood from that question is that code must output repeated strings of maximal length. Consider another example:

"101110110110110"

expected output is:

"110" "101" "011" 
Community
  • 1
  • 1
user3845968
  • 113
  • 7

1 Answers1

1

With a classical pattern, the regex engine can't match the same character at the same position more than once.

Example with the string abcde the pattern abc|bcd will give you only one result (abc) but can't give you a second match for bcd since bc has been already a part of the first match result. In other words you can't obtain overlapping match results, since at the second try the position of the regex engine in the string is after the first match (after abc).

To some significant, it is possible with a trick to obtain overlapping results if you put all the pattern in a lookahead assertion (and if you use a capture group to obtain something). The trick use the fact that a lookahead matches nothing, it's only a test from the current regex engine position, however nothing disallow to put a capture group inside.

Example with abcde and (?=(abc|bcd)): the pattern succeeds at the a position and at the b position, but the two whole match results will be empty, but in the first capture group you will obtain abc and bcd.

However, it's always not possible to obtain several results from the same position in the string, because the regex engine advances to the next character each time it obtains a result (or do not obtain anything at all).

If, with the same string, you tests this pattern (?=(abc|abcd)) you will only obtain abc in the capture group. One more time, the same position can't be tested twice.

about your specific example

Here is the test code:

s <- '111111111111'
m <- gregexpr('(?=(.{2,})\\1{2,})', s, perl=TRUE)

unique(mapply(function(x, y) substr(s, x, x+y-1), 
              attr(m[[1]], 'capture.start'), 
              attr(m[[1]], 'capture.length')))

The goal of the pattern is to find substrings with a length greater than 1 character that is repeated more than 1 time, the repeated substrings must be contiguous.

s contains twelve characters. The regex engine will test the whole pattern at each position as explained above. Since the quantifiers are greedy by default, you will obtain the longest result for each positions.

result:

  • position    : capture group : lookahead whole match
  • character 1: 1111                : 1111 1111 1111
  • character 2: 111                  : 111 111 111
  • character 3: 111                  : 111 111 111
  • character 4: 111                  : 111 111 111
  • character 5: 11                    : 11 11 11 11
  • character 6: 11                    : 11 11 11
  • character 7: 11                    : 11 11 11

The pattern succeeds seven times. The second part of the code only removes duplicates.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Great explanation. But is there no other method to get the desire output? – user3845968 Aug 09 '14 at 14:05
  • @user3845968: You only need to remove the lookahead. – Casimir et Hippolyte Aug 09 '14 at 14:10
  • Also, when i am trying s <- '111111111111'; k <- 2 m <- gregexpr('(?=(.{k,})\\1{2,})', s, perl=TRUE) unique(mapply(function(x, y) substr(s, x, x+y-1), attr(m[[1]], 'capture.start'), attr(m[[1]], 'capture.length'))) output is null. Why? – user3845968 Aug 09 '14 at 14:33
  • @user3845968: Because `k` in the pattern is seen as a literal `k` obviously. – Casimir et Hippolyte Aug 09 '14 at 16:08
  • In second example, there are two outputs of same length, But by removing lookahead its not outputting them. Also that 'k' i want its value to be the length of maximum repeated string, obviously. Please help. – user3845968 Aug 09 '14 at 16:16