Overlapping findall in python

Question

I was debugging some legacy code and found out that we didn't use re.findall correctly.

So I have a set of keywords(could be a phrase too), I need to return all the keywords the occur in a sentence.

keyWords = [keyword1, keyword2,...] # size around ~500
prog = re.compile(r'\b(%s)\b'%"|".join(keyWords)) # has to match the entire word, hence the word boundary \b
prog.findall(sentence)

But it didn't work in the following case:

myKeywords = [A, A B]
mySentence = [A B]

"findall" will only return A, because it's non-overlapping search.

Then I fell back to brutal force using re.search:

set(filter(lambda x: bool(re.search(r'\b(%s)\b'%x, sentence)), keyWords))

but the performance is way too slow. With around ~500 keywords and a less than 10 words sentence, the brutal force takes 10^-2 seconds while the findall only takes 10^-4 seconds. The regex compilation does take 10^-2 seconds, but with more than 1M sentences, it can be ignored.

Is there any built-in method or faster way to do this?

Second thought:

After further investigation, I think this has nothing to do with overlapping or non-overlapping search, meaning even with non-overlapping search, it won't help with issue. It's more a find all phrases(phrase can be a substring of another phrase) in a sentence problem.

You can use the `regex` module (`pip install regex`) if you want overlapping regex — user3483203, Jul 09 '18 at 21:13
Related, perhaps duplicate: https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches — Robᵩ, Jul 09 '18 at 21:14
It depends what you call buit-in. There are implementation of aho-corasick algorithm in python that would give you a boost. [PyAhoCorasick on pypi](https://pypi.org/project/pyahocorasick/) by example. — Michael Doubez, Jul 09 '18 at 21:51
@Robᵩ I've tried that as well, doesn't work. Didn't fully understand it. But my guess is that the forward assertion with finditer will never look back? But in my case, you might have to go back to the beginning. — Weitao Wang, Jul 10 '18 at 14:24

score 0 · Answer 1 · answered Jul 09 '18 at 22:41

0

I think you want the longest matches.

To do that, you can sort your keywords by size in reverse order (longest first)

import re

keywords = ['a', 'b', 'a b']
keywords.sort(key=lambda k: len(k), reverse=True)

regex = r'\b{0}\b'.format('|'.join(keywords))
findall_kw = re.compile(regex).findall

For instance:

sentence = 'a, a b'
print(findall_kw(sentence))

You get:

['a', 'a b']

Note: if your keywords can contains special characters, you may consider using re.escape() to escape them.

answered Jul 09 '18 at 22:41

Laurent LAPORTE

21,958
6
58
103

I don't think this solve my problem. With the exact example, I'm actually expecting ['a', 'b', 'a b'], order doesn't matter. The 'b' should match because it's in your keyword list and it's surrounded by a space and at the end of the string, which satisfied the word boundary condition. – Weitao Wang Jul 10 '18 at 14:00

Overlapping findall in python

1 Answers1