14

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.

I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.

I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.

But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.

> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""

The same goes for using both the stringi and stringr package.

> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""

The correct results that should be returned when executing this are:

[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Edit

  1. I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.

  2. Is the stringi and stringr package not capable of performing this over regmatches?

  3. Please feel free to add to my answer or come up with a different workaround than I have found.

Community
  • 1
  • 1
hwnd
  • 69,796
  • 4
  • 95
  • 132

6 Answers6

7

As far as a workaround, this is what I have come up with to extract the overlapping matches.

> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Please feel free to add or comment on a better way to perform this task.

hwnd
  • 69,796
  • 4
  • 95
  • 132
7

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • +1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this? – hwnd Sep 12 '14 at 03:45
  • I can't speak to `stringr` as I've never used that myself, but `regmatches` really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the `regmatches()` is capturing compared to my function.` – MrFlick Sep 12 '14 at 03:50
  • Yea I've used `regmatches()<-` like that before hand to observe the effect of the zero-width matches. – hwnd Sep 12 '14 at 03:53
5

A stringi solution using a capture group in the look-ahead part:

> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"    
gagolews
  • 12,836
  • 2
  • 50
  • 75
4

Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":

x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i) {
       attr(i,"match.length") <- attr(i,"capture.length")
       i
     })
regmatches(x,m)

#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • +1 Thanks for the additional solution. I've done similar using `capture.start` and `capture.length`. – hwnd Sep 12 '14 at 05:28
4

It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.

x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:

> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Pretty ugly, which is why the stringr etc. packages exist.

Ken Williams
  • 22,756
  • 10
  • 85
  • 147