lookahead in R to identify a pattern of words in order

Question

I am trying to parse a set of strings. I need to find out whether 'bcl-2' was detected in the sample. For example: "bl-2 was detected in 45% of patients". However there are certain possible variation which are challenging:

1."bcl-2 was detected in 45% bcl-6 was not detected"
2. "bcl-2 was not detected bcl-6 was detected in 45%"
3. "no evidendce of bcl-2 bcl-6 was detected in 45%"

So I am trying to define the regex code that would:

1. Lookahead for 'bcl-2'
2. Then, lookahead from that point for 'detected'
3. Then lookbehind between 'bcl-2' and 'detected' to make sure there is no 'not'.
4. If possible lookbehind 'bcl-2' to make sure there is 'no evidence of' (though I can take care of this condition separately)

I tried the following code which doesn't work. Specifically it doesn't lookbehind, so I am guessing there is something inherent to lookbehind that I am missing.

This regex works for "bcl-2 was not detected" but fails for "bcl-2 was detected in 45% bcl-6 was not detected"

y="bcl-2 was detected in 45% bcl-6 was not detected"
grepl("(?=bcl-?2)(?!.*not)(?=.*detected)",y, ignore.case = T,perl=T)

So I thought this will work but it doesn't:

grepl("(?=bcl-?2)(?=.*detected)(?<!not)",y, ignore.case = T,perl=T)

I am trying to understand the logic of lookbehind. In regards to the last line of code -> I thought (?=bcl-?2) looks forward until the point in the string that begins with 'bcl-2'. Then, I thought the (?=.*detected) looks forward until the position in the string where 'detected' start. Then I thought lookbehind starts looking backwards from that position for 'not'. This is of course wrong ... so what am I missing about the lookaround logic

BTW a great website I have been using in an attempt to figure this out https://www.regular-expressions.info/recurse.html

The only positive option you have mentioned is one containing the string `"bcl-2 was detected"`, so why not just search for that? — Andrew Gustar, Oct 07 '17 at 17:14
Try [`\bbcl-2\b(?:(?!\bbcl-\d|\bnot\b).)*?\bdetected\b`](https://regex101.com/r/g5V81L/1). See the [R demo online](https://ideone.com/mxycKT). — Wiktor Stribiżew, Oct 07 '17 at 17:46

score 2 · Accepted Answer · answered Oct 07 '17 at 17:56

2

Lookarounds are zero-width assertions, which means the regex index is not moving when the patterns are matched (the characters matched are not added to the match value and consecutive lookarounds all start their pattern checks from the same location). So, (?=bcl-?2)(?!.*not)(?=.*detected) matches an empty location (empty string) that is followed with bcl2 or bcl-2, that has no not substring after any 0+ chars other than line break chars, and that is followed with detected after any 0+ chars other than line break chars. This pattern is tried at every location in the input string, because there are no anchors. That pattern is hardly doing what you need.

Here is a possible solution:

\bbcl-2\b(?:(?!\bbcl-\d|\bnot\b).)*?\bdetected\b

See the regex demo:

\b - a word boundary
bcl-2 - a bcl-2 substring
\b - a word boundary
(?:(?!\bbcl-\d|\bnot\b).)*? - (a tempered greedy token) any 0+ (but as few as possible) chars other than line break chars that do not start the following two sequences:
- \bbcl-\d - a wor boundary followed with bcl- and a digit
- | - or
- \bnot\b - a whole word not
\bdetected\b - a whole word detected

See an R demo below:

x <- c("bcl-2 was detected in 45% bcl-6 was not detected", 
"bcl-2 was not detected bcl-6 was detected in 45%",
"no evidendce of bcl-2 bcl-6 was detected in 45%")
grep("\\bbcl-2\\b(?:(?!\\bbcl-\\d|\\bnot\\b).)*?\\bdetected\\b", x, perl=TRUE, value=TRUE)
## => [1] "bcl-2 was detected in 45% bcl-6 was not detected"

answered Oct 07 '17 at 17:56

Wiktor Stribiżew

607,720
39
448
563

Thank you ! works great ! Is there an alternative way to do this with something that changes the RegEx position on the string to be the end of the matched lookahead pattern. I initially thought that atomic patterns (?>string) or the \\Z notation at the end of the lookahead pattern should work. But they didn't. From your response I realize that probably neither changes the position on the string. – user2387584 Oct 08 '17 at 11:38
@user2387584 I do not understand what you are asking for. When you use consuming patterns (not inside lookarounds), the regex index is advanced automatically. That is just how the regex engine parses the string, from left to right. You are interested in checking if there is a partial match in any of the character vectors. Right? If not, please explain with an example. – Wiktor Stribiżew Oct 08 '17 at 13:46
@ Wiktor Stribiżew can't thank you enough for your help here and all so in many other questions other users have asked !!! I wish I could UP-100 – user2387584 Oct 14 '17 at 16:44
For my second question, I found an answer in your comments to another user's post [link](https://stackoverflow.com/questions/44855076/negative-lookbehind-in-r-with-multi-word-separation) were you wrote: your "negative lookahead is having problems" because it is not a lookahead, it is a lookbehind that cannot have a pattern of unknown length. Looks like you just can use a lookahead this way - `"^(?!.*.*).*"`for example to look for 'cat' never preceded by 'dog' use `"^(?!.*dog.*cat).*cat"`. Notice the ^ indicating to search from the start of the string. – user2387584 Oct 14 '17 at 17:00
@ Wiktor Stribiżew - do you have any idea why this code `grepl("(?=.*?rearrangement|.*?translocation|.*?fusion|.*?\\S[;:]\\S)(bcl-?2|14[;:]18)(?:(?!\\bnot\\b|cannot|n't|, ).)*?(reveal(ed)?|see(n)?|detect(ed)?|demonstrate(d)?)", y, perl=TRUE,ignore.case = T)` returns a FALSE for the following string `y= "t(14:18) was detected"` ? Each part of the grep (i.e. "(?=.*?rearrangement|.*?translocation|.*?fusion|.*?\\S[;:]\\S)" and "(bcl-?2|14[;:]18)(?:(?!\\bnot\\b|cannot|n't|, ).)*?(reveal(ed)?|see(n)?|detect(ed)?|demonstrate(d)?)") is evaluated as TRUE but together as FALSE – user2387584 Oct 14 '17 at 17:06
I have tried it and see [no reason why it should return false](https://regex101.com/r/1g5XTP/1). I will check again when I am not sleepy. – Wiktor Stribiżew Oct 14 '17 at 21:03
Thank you. I have managed to get it to work by adding `.*?` before the second capturing group, though I have no idea why this is necessary. `grepl("(?=.*?rearrangement|.*?translocation|.*?fusion|.*?\\S‌[;:]\\S).*?(bcl-?2|14[;‌:]18)(?:(?!\\bnot\\b‌|cannot|n't|, ).)*?(reveal(ed)?|see(n)?|detect(ed)?|demonstrate(d)?)", y, perl=TRUE,ignore.case = T)` – user2387584 Oct 15 '17 at 00:03

lookahead in R to identify a pattern of words in order

1 Answers1