With a classical pattern, the regex engine can't match the same character at the same position more than once.
Example with the string abcde
the pattern abc|bcd
will give you only one result (abc
) but can't give you a second match for bcd
since bc
has been already a part of the first match result. In other words you can't obtain overlapping match results, since at the second try the position of the regex engine in the string is after the first match (after abc
).
To some significant, it is possible with a trick to obtain overlapping results if you put all the pattern in a lookahead assertion (and if you use a capture group to obtain something). The trick use the fact that a lookahead matches nothing, it's only a test from the current regex engine position, however nothing disallow to put a capture group inside.
Example with abcde
and (?=(abc|bcd))
: the pattern succeeds at the a
position and at the b
position, but the two whole match results will be empty, but in the first capture group you will obtain abc
and bcd
.
However, it's always not possible to obtain several results from the same position in the string, because the regex engine advances to the next character each time it obtains a result (or do not obtain anything at all).
If, with the same string, you tests this pattern (?=(abc|abcd))
you will only obtain abc
in the capture group. One more time, the same position can't be tested twice.
about your specific example
Here is the test code:
s <- '111111111111'
m <- gregexpr('(?=(.{2,})\\1{2,})', s, perl=TRUE)
unique(mapply(function(x, y) substr(s, x, x+y-1),
attr(m[[1]], 'capture.start'),
attr(m[[1]], 'capture.length')))
The goal of the pattern is to find substrings with a length greater than 1 character that is repeated more than 1 time, the repeated substrings must be contiguous.
s
contains twelve characters. The regex engine will test the whole pattern at each position as explained above. Since the quantifiers are greedy by default, you will obtain the longest result for each positions.
result:
- position : capture group : lookahead whole match
- character 1: 1111 : 1111 1111 1111
- character 2: 111 : 111 111 111
- character 3: 111 : 111 111 111
- character 4: 111 : 111 111 111
- character 5: 11 : 11 11 11 11
- character 6: 11 : 11 11 11
- character 7: 11 : 11 11 11
The pattern succeeds seven times. The second part of the code only removes duplicates.