Overlapping matches in R

Question

I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.

I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.

I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.

But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.

> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""

The same goes for using both the stringi and stringr package.

> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""

The correct results that should be returned when executing this are:

[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Edit

I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.
Is the stringi and stringr package not capable of performing this over regmatches?
Please feel free to add to my answer or come up with a different workaround than I have found.

hwnd · Answer 1 · 2014-09-12T03:54:45.190

7

As far as a workaround, this is what I have come up with to extract the overlapping matches.

> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Please feel free to add or comment on a better way to perform this task.

edited Sep 12 '14 at 03:54

answered Sep 12 '14 at 02:56

hwnd

69,796
4
95
132

The problem with this solution is that it only works when the captured region is always 2 characters long. A more general solution is this: – Ken Williams Aug 10 '15 at 14:46
Oops. I forgot I can't put code blocks in comments. Will make this a separate answer. – Ken Williams Aug 10 '15 at 14:48

MrFlick · Accepted Answer · 2014-09-12T03:50:02.063

7

The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve

x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"

Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.

I've created a regcapturedmatches() function that I often use for such tasks. For example

x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.

edited Sep 12 '14 at 03:50

answered Sep 12 '14 at 03:37

MrFlick

195,160
17
277
295

+1 Interesting function you've created. I am well aware of zero-width matches, so basically regmatches and the other packages such as stringi, r are not meant to handle this? – hwnd Sep 12 '14 at 03:45
I can't speak to `stringr` as I've never used that myself, but `regmatches` really focuses on the match rather than the capture (which are highly related by slightly different). I've added an additional sample to try to make it clear what the `regmatches()` is capturing compared to my function.` – MrFlick Sep 12 '14 at 03:50
Yea I've used `regmatches()<-` like that before hand to observe the effect of the zero-width matches. – hwnd Sep 12 '14 at 03:53

gagolews · Answer 3 · 2019-03-24T20:16:51.753

5

A stringi solution using a capture group in the look-ahead part:

> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

edited Mar 24 '19 at 20:16

answered Oct 26 '14 at 19:55

gagolews

12,836
2
50
75

Weird, how come it failed to work with `stri_extract_all_regex` – hwnd Oct 26 '14 at 20:00
@hwnd: it's a 0-length match; `(?=...)` [does not](http://userguide.icu-project.org/strings/regexp) advance the input position. – gagolews Oct 26 '14 at 20:02
Yes I know it's a zero-width match =) I guess there is a difference between `extract_all_regex` and `match_all_regex` – hwnd Oct 26 '14 at 20:04
No, the 1st column of the resulting matrix (the whole match) consists only of empty strings :) – gagolews Oct 26 '14 at 20:05
Ok now I see and understand what you mean. – hwnd Oct 26 '14 at 20:06

thelatemail · Answer 4 · 2014-09-12T05:46:08.720

4

Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":

x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i) {
       attr(i,"match.length") <- attr(i,"capture.length")
       i
     })
regmatches(x,m)

#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

edited Sep 12 '14 at 05:46

answered Sep 12 '14 at 05:10

thelatemail

91,185
12
128
188

+1 Thanks for the additional solution. I've done similar using `capture.start` and `capture.length`. – hwnd Sep 12 '14 at 05:28

Rich Scriven · Answer 5 · 2015-08-10T15:52:39.150

4

It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.

x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

edited Aug 10 '15 at 15:52

answered Sep 13 '14 at 02:54

Rich Scriven

97,041
11
181
245

score 1 · Answer 6 · answered Aug 10 '15 at 14:51

An additional answer, based on @hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:

> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"

Pretty ugly, which is why the stringr etc. packages exist.

Overlapping matches in R

Edit

6 Answers6

Linked

Related