0

Objective:

Find all positions (start and end index) of a pattern in a string with overlapping allowed.

Approach:

The stri_locate_all_* functions return a list of positions of a pattern in a string. The list includes matrices containing the start index and end index for each match's position. This is convenient for my purposes.

For a fixed pattern, the following works well:

s <- "---"
pattern <- "--"
stri_locate_all_fixed(s, pattern, overlap = TRUE)
[[1]]
    start   end
[1,]    1   2
[1,]    2   3

Two occurrences of the pattern "--" exist in string "s". The first starts at index 1 of s and ends at index 2; and the second starts at index 2 and ends at index 3.

---
---

However, in my case, the pattern may consist of multiple allowable characters (in any order or combination) and the length of the pattern may change. Therefore, "regex" seems more appropriate than "fixed".

Consider a pattern length of two, consisting of any combination of "-" and "1" (i.e, "-1", "1-", "--", "11") and the use of stri_locate_all_regex.

pattern <- "[1|-]{2}"
s <- "-1-"    
stri_locate_all_regex(s, pattern)
[[1]]
    start   end
[1,]    1   2

Note that stri_locate_all_regex does not use the overlap attribute, so the pattern must be adjusted if I want to capture overlaps.

According to various sources, I need to add a positive lookahead to my regex.

pattern <- "(?=[1|-]{2})"

This pattern should (and does when tested on the regex101 tester) find the overlapping occurrences of the pattern.

However, when using the stri_locate_all_regex the returned value is not what I'm looking for.

stri_locate_all_regex("---", "(?=[1|-]{2})")
[[1]]
     start end
[1,]     1   0
[2,]     2   1

Here, the function correctly identified that two matches exist and noted the start indices, but the end indices are lower than the start indices.

The Stringi documentation states:

"For stri_locate_*_regex, if the match is of length 0, end will be one character less than start."

This suggests the matches are length 0; this observation is further supported by this description of regex "lookarounds":

"Lookahead and lookbehind, collectively called “lookaround”, are zero-length assertions...that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match."

So, my issue seems to lay in the use of the positive lookahead assertion that appears to return a zero-length position at the "start" index.

My Distilled Questions:

-Is there a better regexp method for capturing overlapping (non-zero-length) matches? or,

-Is there a better r function than stri_locate_all_regex to achieve the desired output (a list of all start/end positions of pattern matches in a string)

Thanks!

AWaddington
  • 725
  • 8
  • 18

2 Answers2

2

You can use gregexpr and a PCRE regex with a capturing group enclosing the entire positive lookahead pattern:

pattern <- "(?=([1-]{2}))"
s <- "-1-"
res <- gregexpr(pattern, s, perl=TRUE)
starts <- attr(res[[1]],'capture.start') 
lengths <- attr(res[[1]],'capture.length')
ends <- starts + lengths - 1
df_positions <- do.call(rbind, Map(data.frame, start=starts, end=ends, length=lengths))
df_positions

Output:

  start end length
1     1   2      2
2     2   3      2

See an R demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You could repeat the search using a lookbehind and then combine the two results. Inside a function it doesn't mess up your code but is probably somewhat inefficient:

library(stringi)

stri_locate_overlap <- function(str, pattern) {
  s <- stri_locate_all_regex(str, paste0("(?=", pattern, ")")) # match start, length 0
  e <- stri_locate_all_regex(str,  paste0("(?<=", pattern, ")")) # match end, length 0
  # combine two results
  mapply(function(x, y) {
    data.frame(start = x[, 1], 
               end = y[, 1])
  }, x = s, y = e, SIMPLIFY = FALSE)
}

stri_locate_overlap(c("---", "-1-"), "[1|-]{2}")
#> [[1]]
#>   start end
#> 1     1   3
#> 2     2   4
#> 
#> [[2]]
#>   start end
#> 1     1   3
#> 2     2   4
JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Interesting thought for a workaround. I'm curious why the end position is outside the index range (and lengths are three not two). I think the lookbehind iterates past the last character to find the end of the string, then works backward from there so you would have to subtract one from the end value. – AWaddington Aug 30 '20 at 22:49