locate all overlapping patterns in string

Question

I find this function "str_locate_all":

library(stringr)
string = paste0(c(5,5,5,6,6,5,5,6), collapse = "")
pattern = paste0(c(5,5), collapse = "")

 str_locate_all(string, pattern)
[[1]]
     start end
[1,]     1   2
[2,]     6   7

Here I look for (only consecutive) pattern '55' in string '55566556' . It tells me that it occurs only twice - but I see that '55' also happens between position 2 and position 3.

How to get this function to output?

> str_locate_all(string, pattern)
[[1]]
     start end
[1,]     1   2
[2,]     2   3
[3,]     6   7`

Does this answer your question? [Overlapping matches in R](https://stackoverflow.com/questions/25800042/overlapping-matches-in-r) — SamR, Dec 12 '22 at 06:18
hi i saw this question but i dont know how to apply this logic to my question. can u pls suggest how to do? thkx! — stats_noob, Dec 12 '22 at 06:20

SamR · Accepted Answer · 2022-12-12T07:48:38.837

Regex matches consume characters

To elaborate on my comment, see this Python answer:

Except for zero-length assertion, character in the input will always be consumed in the matching. If you are ever in the case where you want to capture certain character in the input string more the once, you will need zero-length assertion in the regex.

What happens in your case

We can step through a (simplified) version of regex matching your string "55566556", with your pattern, "55":

Match 1: Characters in position 1 and 2 match "55" and are consumed. State of string: "566556".
Characters 3 and 4 (maintaining original indices), "56", are not a match.
Characters 4 and 5, "66", are not a match.
Characters 5 and 6, "65", are not a match.
Match 2: Characters in position 6 and 7 match "55" and are consumed.
Character 8, "6", is not a match.
No more matches.

Using a pattern which does not consume the input (zero-length assertion)

To resolve this issue, you need to use a pattern which does not consume the input string when it finds a match:

There are several zero-length assertion (e.g. ^ (start of input/line), $ (end of input/line), \b (word boundary)), but look-arounds ((?<=) positive look-behind and (?=) positive look-ahead) are the only way that you can capture overlapping text from the input. Negative look-arounds ((?<!) negative look-behind, (?!) negative look-ahead) are not very useful here: if they assert true, then the capture inside failed; if they assert false, then the match fails. These assertions are zero-length (as mentioned before), which means that they will assert without consuming the characters in the input string. They will actually match empty string if the assertion passes.

However you will see slightly strange output if you apply a lookahead pattern directly:

lookahead_pattern  <- paste0("(?=(", pattern, "))") # (?=(55))
str_locate_all(string, lookahead_pattern)
# [[1]]
#      start end
# [1,]     1   0
# [2,]     2   1
# [3,]     6   5

As you can see, the start positions are correct but the end positions are not. That is because we have had to use a zero-length match, in order to not consume the string.

In this case we know the length of the match is 2 characters. However, we do not always know the length from the input (e.g. in variable length matches such as "5.+"). One way around this is to get the matching text using stringi:

stringi::stri_match_all_regex(string, lookahead_pattern)
# [[1]]
#      [,1] [,2]
# [1,] ""   "55"
# [2,] ""   "55"
# [3,] ""   "55"

Putting it together to get your desired output

I am going to use stringi::stri_locate_all_regex, rather than stringr::str_locate_all, which is a wrapper for it:

library(stringi)

string <- paste0(c(5, 5, 5, 6, 6, 5, 5, 6), collapse = "")
pattern <- paste0(c(5, 5), collapse = "")
lookahead_pattern <- paste0("(?=(", pattern, "))")


match_starts <- stri_locate_all_regex(
    string,
    lookahead_pattern
)[[1]]

# "55" "55" "55"
match_text  <- stri_match_all_regex(string, lookahead_pattern)[[1]][,2]

match_end   <- match_starts[,"start"] + nchar(match_text) - 1

match_indices  <- data.frame(
    start = match_starts[,"start"],
    end = match_end
)

match_indices
#   start end
# 1     1   2
# 2     2   3
# 3     6   7

Incidentally, you can also do this all in base R, using the approach here.

locate all overlapping patterns in string

1 Answers1

Regex matches consume characters

What happens in your case

Using a pattern which does not consume the input (zero-length assertion)

Putting it together to get your desired output