Objective:
Find all positions (start and end index) of a pattern in a string with overlapping allowed.
Approach:
The stri_locate_all_*
functions return a list of positions of a pattern in a string. The list includes matrices containing the start index and end index for each match's position. This is convenient for my purposes.
For a fixed pattern, the following works well:
s <- "---"
pattern <- "--"
stri_locate_all_fixed(s, pattern, overlap = TRUE)
[[1]]
start end
[1,] 1 2
[1,] 2 3
Two occurrences of the pattern "--" exist in string "s". The first starts at index 1 of s and ends at index 2; and the second starts at index 2 and ends at index 3.
--
-
---
However, in my case, the pattern may consist of multiple allowable characters (in any order or combination) and the length of the pattern may change. Therefore, "regex" seems more appropriate than "fixed".
Consider a pattern length of two, consisting of any combination of "-" and "1" (i.e, "-1", "1-", "--", "11") and the use of stri_locate_all_regex
.
pattern <- "[1|-]{2}"
s <- "-1-"
stri_locate_all_regex(s, pattern)
[[1]]
start end
[1,] 1 2
Note that stri_locate_all_regex
does not use the overlap attribute, so the pattern must be adjusted if I want to capture overlaps.
According to various sources, I need to add a positive lookahead to my regex.
pattern <- "(?=[1|-]{2})"
This pattern should (and does when tested on the regex101 tester) find the overlapping occurrences of the pattern.
However, when using the stri_locate_all_regex
the returned value is not what I'm looking for.
stri_locate_all_regex("---", "(?=[1|-]{2})")
[[1]]
start end
[1,] 1 0
[2,] 2 1
Here, the function correctly identified that two matches exist and noted the start indices, but the end indices are lower than the start indices.
The Stringi documentation states:
"For stri_locate_*_regex, if the match is of length 0, end will be one character less than start."
This suggests the matches are length 0; this observation is further supported by this description of regex "lookarounds":
"Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions...that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match."
So, my issue seems to lay in the use of the positive lookahead assertion that appears to return a zero-length position at the "start" index.
My Distilled Questions:
-Is there a better regexp method for capturing overlapping (non-zero-length) matches? or,
-Is there a better r function than stri_locate_all_regex
to achieve the desired output (a list of all start/end positions of pattern matches in a string)
Thanks!