0

I am compiling a regex pattern from a list (a long list). And then applying that to extract some match. Now the problem is, when there is especial symbol in the element of the list, I cannot compile that into a regex pattern. Can someone please shade some lights on this?

For example, the following I tried will work as long as they symbol ")" was not introduced. But my list will have many elements with different symbols.

# my list 
my_tokens = ["my test 1", "my test 2", "my test 3", "many large test n", "my test X"]    
# List with token ) at the end, the last one -- does not work
#my_tokens = ["my test 1", "my test 2", "my test 3", "many large test n", "my test )"]

reg = r'\b(%s|\w+)\b' % '|'.join(my_tokens)

my_test_sentence = "my test 1 and my test 3 and so on my test X"


for token in re.finditer(reg, my_test_sentence):
    print(token.group())

Thank you in advance!

Droid-Bird
  • 1,417
  • 5
  • 19
  • 43

1 Answers1

0

Call re.escape() on your tokens to escape any characters that have special meaning.

reg = r'(?:^|\W)(%s|\w+)(?:\W|$)' % '|'.join(map(re.escape, my_tokens))

You also can't use \b as the boundaries around the matches, since there's no word boundary between ) and whitespace or the end of the string. Change these to match a non-word character or the string boundaries.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • 1
    Thanks @Barmar for your help. Seems you are right. But no luck here, didn't work on "my test )" token. – Droid-Bird Aug 05 '22 at 21:41
  • It should. If you `print(reg)` you should see `my test \)` – Barmar Aug 05 '22 at 21:43
  • 1
    The problem is due to `\b`, which matches a word boundary. A word boundary is between a word character and a non-word character. `)` is a non-word character, and there's no word character after it in your string. – Barmar Aug 05 '22 at 21:47
  • It does capture the idea you mentioned. However, re.finditer(... ) does not capture the expected output. It does not print "my test )" in the output, if you run the code above. – Droid-Bird Aug 05 '22 at 21:48