Regex matches consume characters
To elaborate on my comment, see this Python answer:
Except for zero-length assertion, character in the input will always be consumed in the matching. If you are ever in the case where you want to capture certain character in the input string more the once, you will need zero-length assertion in the regex.
What happens in your case
We can step through a (simplified) version of regex matching your string "55566556"
, with your pattern, "55"
:
- Match 1: Characters in position
1
and 2
match "55"
and are consumed. State of string: "566556"
.
- Characters 3 and 4 (maintaining original indices),
"56"
, are not a match.
- Characters 4 and 5,
"66"
, are not a match.
- Characters 5 and 6,
"65"
, are not a match.
- Match 2: Characters in position
6
and 7
match "55"
and are consumed.
- Character 8,
"6"
, is not a match.
- No more matches.
Using a pattern which does not consume the input (zero-length assertion)
To resolve this issue, you need to use a pattern which does not consume the input string when it finds a match:
There are several zero-length assertion (e.g. ^ (start of input/line), $ (end of input/line), \b (word boundary)), but look-arounds ((?<=) positive look-behind and (?=) positive look-ahead) are the only way that you can capture overlapping text from the input. Negative look-arounds ((?<!) negative look-behind, (?!) negative look-ahead) are not very useful here: if they assert true, then the capture inside failed; if they assert false, then the match fails. These assertions are zero-length (as mentioned before), which means that they will assert without consuming the characters in the input string. They will actually match empty string if the assertion passes.
However you will see slightly strange output if you apply a lookahead pattern directly:
lookahead_pattern <- paste0("(?=(", pattern, "))") # (?=(55))
str_locate_all(string, lookahead_pattern)
# [[1]]
# start end
# [1,] 1 0
# [2,] 2 1
# [3,] 6 5
As you can see, the start positions are correct but the end positions are not. That is because we have had to use a zero-length match, in order to not consume the string.
In this case we know the length of the match is 2 characters. However, we do not always know the length from the input (e.g. in variable length matches such as "5.+"
). One way around this is to get the matching text using stringi
:
stringi::stri_match_all_regex(string, lookahead_pattern)
# [[1]]
# [,1] [,2]
# [1,] "" "55"
# [2,] "" "55"
# [3,] "" "55"
Putting it together to get your desired output
I am going to use stringi::stri_locate_all_regex
, rather than stringr::str_locate_all
, which is a wrapper for it:
library(stringi)
string <- paste0(c(5, 5, 5, 6, 6, 5, 5, 6), collapse = "")
pattern <- paste0(c(5, 5), collapse = "")
lookahead_pattern <- paste0("(?=(", pattern, "))")
match_starts <- stri_locate_all_regex(
string,
lookahead_pattern
)[[1]]
# "55" "55" "55"
match_text <- stri_match_all_regex(string, lookahead_pattern)[[1]][,2]
match_end <- match_starts[,"start"] + nchar(match_text) - 1
match_indices <- data.frame(
start = match_starts[,"start"],
end = match_end
)
match_indices
# start end
# 1 1 2
# 2 2 3
# 3 6 7
Incidentally, you can also do this all in base R, using the approach here.