2

Yet another post on negative lookbehind in regex in R but I can't find what I'm doing wrong here.

I have these strings:

test <- c("a %in% c('b', 'e')" , "case_when(a %in% c('b', 'e'))", "hello")

I want to detect which strings contain a %in% without being preceded by a case_when(. I can find which ones contain a case_when( and then a %in% with this regex:

grepl("(?=.*case\\_when\\()(.*%in%)", test, perl = TRUE)
#> [1] FALSE  TRUE FALSE

So I just need to negate this lookbehind and I thought replacing = by <! was enough but apparently not:

grepl("(?<!case\\_when\\()(.*%in%)", test, perl = TRUE)
#> [1]  TRUE  TRUE FALSE

The expected output is TRUE FALSE FALSE. What am I doing wrong?

bretauv
  • 7,756
  • 2
  • 20
  • 57

2 Answers2

3

.* is greedy. It matches everything, including case_when. That said, .*%in% matches case_when(a %in%, and since this phrase is not preceded by another case_when(, it is counted as a match.

case_when(a %in% c('b', 'e'))
^^^^^^^^^^^^
    '.*'

You can replace . with [^(] so as not to match opening brackets, and use the (*SKIP)(*FAIL) idiom to exclude what you don't want.

case_when\(      # Match 'case_when('
[^(]*%in%        # followed by 0+ non-opening-bracket character and '%in'.
(*SKIP)(*FAIL)   # then skip and forfeit everything we just matched
|                # before matching
[^(]*%in%        # every other instance of `[^(]*%in%`.

Try it on regex101.com.

Try it:

test <- c("a %in% c('b', 'e')", "case_when(a %in% c('b', 'e'))", "hello")
grepl("case_when\\([^\\(]*%in%(*SKIP)(*FAIL)|[^\\(]*%in%", test, perl = TRUE)
#> [1]  TRUE FALSE FALSE

Do note that parsing (presumably) R, a non-regular language, with regex is most likely not a good choice.

InSync
  • 4,851
  • 4
  • 8
  • 30
1

Does not (?! -- negative lookahead) start (^) with "case_when" and contains "%in%":

grepl("^(?!case_when).*%in%.*", test, perl = T)

Or you could break it into two matches (do not detect "case_when" and detect "%in%"):

!grepl("case_when", test, fixed = T) & grepl("%in%", test, fixed = T)
LMc
  • 12,577
  • 3
  • 31
  • 43