I have a list of keywords and want to extract all the keywords that can be found in a document. One keyword may be the substring of another keyword. I tried to extract with re.findall
function, but what I get is either one keyword or the substring of the keyword. If 'A' and 'A B' are the keywords, I want to extract both.
Take one simplified case as an example:
The document is "A B C D"
. The keyword is "A", "B", "A B"
. The output of my regex pattern is like:
string = "A B C D"
regex = 'A\ B|A|B'
re.findall(regex, string)
>>> ['A B']
string = "A B C D"
regex = 'A|B|A\ B'
re.findall(regex, string)
>>> ['A', 'B']
The expected output is
['A', 'B', 'A B']
Updated: Similar post suggested to use new Python regex module to solve the overlapped example.
import regex as re
re.findall(r'A\\ B|B\\ C', 'A B C', overlapped=True)
>>> ['A B', 'B C']
However the solution can not solve the case that one pattern is a substring of another pattern:
import regex as re
re.findall(r'A\\ B|A', 'A B C', overlapped=True)
>>> ['A B']
expected:
>>> ['A B', 'A']
PS:
To be more specific, my regex pattern is like "(?<!\w)A\\ B(?!\w)|(?<!\w)A(?!\w)"
but I think the simplified case is more clear.