0

I need to find all occurrences of a list of words in a text using regex. For example, given the words:

words = {'i', 'me', 'my'}

and some

text = 'A book is on the table. I have a book on the table. My book is on the table. There is my book on the table.'

should return result = ["I", "My", "my"]

I'm using this:

re.findall(r"'|'.join(words))", text,flags=re.IGNORECASE))

But it's returning an empty list.

Also if I use this:

re.findall(r"(?=("+'|'.join(words)+r"))", text, flags=re.IGNORECASE))

returns:

['i', 'I', 'My', 'i', 'i', 'my']

which is incorrect.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Hossein
  • 1,152
  • 1
  • 16
  • 32
  • https://stackoverflow.com/questions/54481198/python-match-multiple-substrings-in-a-string – AMC Feb 08 '20 at 21:07
  • Please be more specific about what the issue is. Which part are you struggling with? – AMC Feb 08 '20 at 21:07

3 Answers3

1

There is a problem in the way you define the regex. You are not joining the words, you are using a regex "'|'.join(words)", which leads in no matches.

>>> x = r"'|'.join(words)"
>>> x
"'|'.join(words)"

You can rewrite it as

>>> re.findall(r"\b({})\b".format('|'.join(words)), text[0], flags=re.IGNORECASE)
['I', 'My', 'my']

Note \b here is a world boundary that matches the empty string at the beginning or end of a word needed in order to only match full words.

abc
  • 11,579
  • 2
  • 26
  • 51
1
re.compile('|'.join(map(lambda x: '\\b' + x + '\\b', words)), 
           flags=re.IGNORECASE)
  .findall(text[0])

Putting \b on either side of words keeps "I" from matching things like "is".

Michael Lorton
  • 43,060
  • 26
  • 103
  • 144
1

This is how I will do:

This regex will get values from my list, that can be preceded or exceeded by not a word e.g: Is it I?

import re

words = ["I", "am", "my"]
text = "A book is on the table. I have a book on the table. My book is on the table. There is my book on the table."

pattern = r'\W.*?({})\W.*?'.format('|'.join(words))
s = re.findall(pattern, text, flags=re.IGNORECASE)
print(s)
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57