I was debugging some legacy code and found out that we didn't use re.findall correctly.
So I have a set of keywords(could be a phrase too), I need to return all the keywords the occur in a sentence.
keyWords = [keyword1, keyword2,...] # size around ~500
prog = re.compile(r'\b(%s)\b'%"|".join(keyWords)) # has to match the entire word, hence the word boundary \b
prog.findall(sentence)
But it didn't work in the following case:
myKeywords = [A, A B]
mySentence = [A B]
"findall" will only return A, because it's non-overlapping search.
Then I fell back to brutal force using re.search:
set(filter(lambda x: bool(re.search(r'\b(%s)\b'%x, sentence)), keyWords))
but the performance is way too slow. With around ~500 keywords and a less than 10 words sentence, the brutal force takes 10^-2 seconds while the findall only takes 10^-4 seconds. The regex compilation does take 10^-2 seconds, but with more than 1M sentences, it can be ignored.
Is there any built-in method or faster way to do this?
Second thought:
After further investigation, I think this has nothing to do with overlapping or non-overlapping search, meaning even with non-overlapping search, it won't help with issue. It's more a find all phrases(phrase can be a substring of another phrase) in a sentence problem.