0

I was debugging some legacy code and found out that we didn't use re.findall correctly.

So I have a set of keywords(could be a phrase too), I need to return all the keywords the occur in a sentence.

keyWords = [keyword1, keyword2,...] # size around ~500
prog = re.compile(r'\b(%s)\b'%"|".join(keyWords)) # has to match the entire word, hence the word boundary \b
prog.findall(sentence)

But it didn't work in the following case:

myKeywords = [A, A B]
mySentence = [A B]

"findall" will only return A, because it's non-overlapping search.

Then I fell back to brutal force using re.search:

set(filter(lambda x: bool(re.search(r'\b(%s)\b'%x, sentence)), keyWords))

but the performance is way too slow. With around ~500 keywords and a less than 10 words sentence, the brutal force takes 10^-2 seconds while the findall only takes 10^-4 seconds. The regex compilation does take 10^-2 seconds, but with more than 1M sentences, it can be ignored.

Is there any built-in method or faster way to do this?

Second thought:

After further investigation, I think this has nothing to do with overlapping or non-overlapping search, meaning even with non-overlapping search, it won't help with issue. It's more a find all phrases(phrase can be a substring of another phrase) in a sentence problem.

Weitao Wang
  • 189
  • 1
  • 2
  • 11
  • 1
    You can use the `regex` module (`pip install regex`) if you want overlapping regex – user3483203 Jul 09 '18 at 21:13
  • Related, perhaps duplicate: https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches – Robᵩ Jul 09 '18 at 21:14
  • It depends what you call buit-in. There are implementation of aho-corasick algorithm in python that would give you a boost. [PyAhoCorasick on pypi](https://pypi.org/project/pyahocorasick/) by example. – Michael Doubez Jul 09 '18 at 21:51
  • @Robᵩ I've tried that as well, doesn't work. Didn't fully understand it. But my guess is that the forward assertion with finditer will never look back? But in my case, you might have to go back to the beginning. – Weitao Wang Jul 10 '18 at 14:24

1 Answers1

0

I think you want the longest matches.

To do that, you can sort your keywords by size in reverse order (longest first)

import re

keywords = ['a', 'b', 'a b']
keywords.sort(key=lambda k: len(k), reverse=True)

regex = r'\b{0}\b'.format('|'.join(keywords))
findall_kw = re.compile(regex).findall

For instance:

sentence = 'a, a b'
print(findall_kw(sentence))

You get:

['a', 'a b']

Note: if your keywords can contains special characters, you may consider using re.escape() to escape them.

Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103
  • I don't think this solve my problem. With the exact example, I'm actually expecting ['a', 'b', 'a b'], order doesn't matter. The 'b' should match because it's in your keyword list and it's surrounded by a space and at the end of the string, which satisfied the word boundary condition. – Weitao Wang Jul 10 '18 at 14:00