1

I'm filtering a column by a regex expression that checks to see if certain phrases from a list exist in the text field:

phrase = ["email was deleted", "click on link", etc.]
df['text'].str.contains(r'\b(?:{})\b'.format('|'.join(sorted(phrase, key=len, reverse=True))), case=False, regex=True)

However, now I'd like to add a condition to exclude any results that are preceded by a list of phrases/words:

neg_phrases = ["did not", "not", "no"]

So I would expect a row with the phrase "Steve told Mary the email was deleted" anywhere in the text to be in the output, however if it was "Steve told Mary no email was deleted", then it shouldn't. Just having trouble with how to work in the negative lookbehind

chicagobeast12
  • 643
  • 1
  • 5
  • 20

1 Answers1

1

Considering there are no space issues in your strings (no double spaces and all spaces are regular \x20 spaces) you can use

pattern = r'\b(?<!{} )(?:{})\b'.format(' )(?<!'.join(neg_phrases),'|'.join(sorted(phrase, key=len, reverse=True)))

See the regex demo.

The \b(?<!did not )(?<!not )(?<!no )(?:email was deleted|click on link)\b pattern will only match email was deleted or click on link if not immediately preceded with did not, not or no followed with a space.

You may also replace a literal space with \s to match any whitespace:

pattern = r'\b(?<!{}\s)(?:{})\b'.format('\s)(?<!'.join(neg_phrases),'|'.join(sorted(phrase, key=len, reverse=True)))

In case your phrases can contain special chars, they need to be re.escaped, replace sorted(phrase, key=len, reverse=True) with sorted(map(re.escape, phrase), key=len, reverse=True) and replace word boundaries with adaptive dynamic word boundaries:

pattern = r'(?!\B\w)(?<!{}\s)(?:{})(?<!\w\B)'.format('\s)(?<!'.join(map(re.escape, neg_phrases)),'|'.join(sorted(map(re.escape, phrase), key=len, reverse=True)))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Will give it a shot... thank you for the extensive explanation, makes sense. Turns out I had the negative lookbehind right, just botched the format. Don't think my phrase list will include any special characters, so the second solution should work for my use case. Thanks again! – chicagobeast12 Feb 18 '22 at 14:58