0

I lack experience with regex and need help with this problem:

Objective: Create a regex pattern that matches "renal" or "kidney" in an arbitrary string only if it does not contain "carcinoma".

Example strings:

  1. "Renal cell carcinoma"
  2. "Clear cell carcinoma of kidney"
  3. "Chronic renal impairment"

Expected output: The regex pattern does not match "renal" and "kidney" in the first two strings; it does match "renal" in the third string (since there is no "carcinoma").

What I've tried: (?<!carcinoma).*(kidney|renal). I stopped here because it didn't work — because, as I've learned here and here, lookbehinds are limited to basically non-zero length; regular expressions cannot be applied backwards an arbitrary length.

So what regex pattern will do the trick? I want a pattern that maintains focus on (or is "anchored" on) "renal" and "kidney" and not "carcinoma".

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
billiam
  • 132
  • 1
  • 15
  • 1
    You could use a negative lookahead from the start of the string `^(?!.*\bcarcinoma\b).*\b(renal|kidney)\b.*` https://regex101.com/r/1FVdio/1 – The fourth bird Jul 03 '20 at 18:55
  • That works, thank you very much! Let me make sure I understand it: it anchors at the start of the string (with ^); does a nonzero-length, negative lookahead for carcinoma, followed by the inclusion of renal or kidney. – billiam Jul 03 '20 at 19:07
  • Yes, if by "inclusion" you mean it matches "renal" or "kidney" preceded (from the start of the string) by zero or more characters. Without the anchor, for the string `"my kidney is great"` the negative lookahead would be satisfied, after which the regex's pointer would be between the `"k"` and `"i"` of `"kidney"`, because from that location `"kidney"` does not appear in the rest of the string. Hence the need for the anchor. – Cary Swoveland Jul 03 '20 at 20:08
  • In general, to assert the string *does not* contain a specific string, as here, use a negative lookahead with the anchor `^` (or possibly `\A`, depending on the language and application). To assert the string *does* contain a specific string, use a positive lookahead with the same anchor. – Cary Swoveland Jul 03 '20 at 20:15

1 Answers1

1

The pattern that you tried (?<!carcinoma).*(kidney|renal) asserts what is directly to the left is not carcinoma which is true from the start of the string.

Then it will match any char 0+ times until the end of the string and tries to backtrack to fit in either kidney or renal.


Instead of using (?<!carcinoma), use ^(?!.*\bcarcinoma\b) to assert from the start of the string that bcarcinoma is not present at the right.

Then match either the word renal or kidney in the string.

^(?!.*\bcarcinoma\b).*\b(renal|kidney)\b.*

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70