Incorporating special characters in Python's re.compile

Question

Concise description

I am working on a project where I have a list of keywords(has special character in it) and I have a string, I have to check whether any of the keywords are present in that string and extract the same. It is going to be a case insensitive search. But the exact keyword has to be present. If SAP is a keyword then sap is a positive hit while saphire is a negative hit.

I have put in a lot of efforts, but I could only achieve output which is partially what I am looking for.

This is a sample code for you to understand :

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAAA and ERP(2.0)"
>>> r.findall(word)
['']

here I should be getting this output - ["ERP(2.0)"]

I have checked out this question : Escape regex special characters in a Python string but this doesnt really answer my question.

can anyone please guide me how to make this work, considering I have 10's of keywords which has special character in it, and I am importing those keywords from MySQL?

Detailed description

Test 1

>>> keywords = ["HIPAA", "ERP"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAA and ERP"
>>> r.findall(word)
['HIPAA', 'ERP']

Test 2

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAA and ERP(2.0)"
>>> r.findall(word)
['']

Test 3

>>> keywords = ["HIPAA", "ERP\(2.0\)"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAA and ERP(2.0)"
>>> r.findall(word)
['HIPAA']

Test 4

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([r'\b%s\b' % re.escape(w) for w in keywords]), flags=re.I)
>>> word = r"HIPAASTOL and ERP(2.0)"
>>> r.findall(word)
[]

Test 5

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = r"HIPAASTOL and ERP(2.0)"
>>> r.findall(word)
['HIPAA', 'ERP(2.0)']

Thanks in advance :)

score 1 · Accepted Answer · answered Mar 11 '20 at 12:34

Special characters have to be escaped.
By definition, a word boundary \b is a zero-length assertion that matches a ... boundary between a word character \w or [a-zA-z0-9_] and a non word character \W or [a-zA-z0-9_].

In your case, you have the regex: \bHIPAA\b|\bERP(2.0)\b

There're no problem with the former \bHIPAA\b but the later \bERP(2.0)\b shows 2 errors.

the parens have to be escaped.
the last word boundary requires a word character just after the closing parens.

To escape special characters, you have to use re.escape function:

re.escape(w) for w in keywords

To detect word boundaries without using \b, you have to assert there're no word character before and after the keywords, for that you have to use lookaround:

(?<!\w) before the keyword, negative lookbehind, make sure we haven't a word character before
(?!\w) after the keyword, negative lookahead, make sure we haven't a word character after.

You regex becomes:

r = re.compile('|'.join([r'(?<!\w)%s(?!\w)' % re.escape(w) for w in keywords]), flags=re.I)

Demo & explanation

moys · Answer 2 · 2020-03-03T05:48:18.627

0

This works

keywords = ["HIPAA", "ERP(2.0)"]
r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
word = r"HIPAA and ERP(2.0)"
r.findall(word)

output

['HIPAA', 'ERP(2.0)']

edited Mar 03 '20 at 05:48

answered Mar 03 '20 at 05:21

moys

7,747
2
11
42

thanks for your answer, I have updated the question a bit, can you please have a look? – Sankar Mar 03 '20 at 05:42
If I am adding this condition `'\b%s\b'` , the output is not coming. updated the question as well, can you please check once? – Sankar Mar 11 '20 at 10:30
@Wiktor Stribiżew, can you please check this once? – Sankar Mar 11 '20 at 10:46

Incorporating special characters in Python's re.compile

2 Answers2