-1

Concise description

I am working on a project where I have a list of keywords(has special character in it) and I have a string, I have to check whether any of the keywords are present in that string and extract the same. It is going to be a case insensitive search. But the exact keyword has to be present. If SAP is a keyword then sap is a positive hit while saphire is a negative hit.

I have put in a lot of efforts, but I could only achieve output which is partially what I am looking for.

This is a sample code for you to understand :

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAAA and ERP(2.0)"
>>> r.findall(word)
['']

here I should be getting this output - ["ERP(2.0)"]

I have checked out this question : Escape regex special characters in a Python string but this doesnt really answer my question.

can anyone please guide me how to make this work, considering I have 10's of keywords which has special character in it, and I am importing those keywords from MySQL?

Detailed description

Test 1

>>> keywords = ["HIPAA", "ERP"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAA and ERP"
>>> r.findall(word)
['HIPAA', 'ERP']

Test 2

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAA and ERP(2.0)"
>>> r.findall(word)
['']

Test 3

>>> keywords = ["HIPAA", "ERP\(2.0\)"]
>>> r = re.compile('|'.join([r'\b%s\b' % w for w in keywords]), flags=re.I)
>>> word = "HIPAA and ERP(2.0)"
>>> r.findall(word)
['HIPAA']

Test 4

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([r'\b%s\b' % re.escape(w) for w in keywords]), flags=re.I)
>>> word = r"HIPAASTOL and ERP(2.0)"
>>> r.findall(word)
[]

Test 5

>>> keywords = ["HIPAA", "ERP(2.0)"]
>>> r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
>>> word = r"HIPAASTOL and ERP(2.0)"
>>> r.findall(word)
['HIPAA', 'ERP(2.0)']

Thanks in advance :)

Sankar
  • 546
  • 4
  • 15

2 Answers2

1
  1. Special characters have to be escaped.
  2. By definition, a word boundary \b is a zero-length assertion that matches a ... boundary between a word character \w or [a-zA-z0-9_] and a non word character \W or [a-zA-z0-9_].

In your case, you have the regex: \bHIPAA\b|\bERP(2.0)\b

There're no problem with the former \bHIPAA\b but the later \bERP(2.0)\b shows 2 errors.

  1. the parens have to be escaped.
  2. the last word boundary requires a word character just after the closing parens.

To escape special characters, you have to use re.escape function:

re.escape(w) for w in keywords

To detect word boundaries without using \b, you have to assert there're no word character before and after the keywords, for that you have to use lookaround:

  • (?<!\w) before the keyword, negative lookbehind, make sure we haven't a word character before
  • (?!\w) after the keyword, negative lookahead, make sure we haven't a word character after.

You regex becomes:

r = re.compile('|'.join([r'(?<!\w)%s(?!\w)' % re.escape(w) for w in keywords]), flags=re.I)

Demo & explanation

Toto
  • 89,455
  • 62
  • 89
  • 125
0

This works

keywords = ["HIPAA", "ERP(2.0)"]
r = re.compile('|'.join([re.escape(w) for w in keywords]), flags=re.I)
word = r"HIPAA and ERP(2.0)"
r.findall(word)

output

['HIPAA', 'ERP(2.0)']
moys
  • 7,747
  • 2
  • 11
  • 42