0

I have a list of keywords and want to extract all the keywords that can be found in a document. One keyword may be the substring of another keyword. I tried to extract with re.findall function, but what I get is either one keyword or the substring of the keyword. If 'A' and 'A B' are the keywords, I want to extract both.

Take one simplified case as an example:

The document is "A B C D". The keyword is "A", "B", "A B". The output of my regex pattern is like:

string = "A B C D"
regex = 'A\ B|A|B'
re.findall(regex, string)
>>> ['A B']
string = "A B C D"
regex = 'A|B|A\ B'
re.findall(regex, string)
>>> ['A', 'B']

The expected output is

['A', 'B', 'A B']

Updated: Similar post suggested to use new Python regex module to solve the overlapped example.

import regex as re
re.findall(r'A\\ B|B\\ C', 'A B C', overlapped=True)
>>> ['A B', 'B C']

However the solution can not solve the case that one pattern is a substring of another pattern:

import regex as re
re.findall(r'A\\ B|A', 'A B C', overlapped=True)
>>> ['A B']

expected:

>>> ['A B', 'A']

PS: To be more specific, my regex pattern is like "(?<!\w)A\\ B(?!\w)|(?<!\w)A(?!\w)" but I think the simplified case is more clear.

Trista
  • 45
  • 1
  • 6
  • Does this answer your question? [How to find overlapping matches with a regexp?](https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) – Nick Mar 14 '22 at 05:34
  • Or perhaps this: https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches – Nick Mar 14 '22 at 05:35
  • @Nick Thanks for your reply! The situation is quite similar but it does not fully solve my case. For example adding overlapped=True to regex module works for `re.findall(r'A\\ B|B\\ C', 'A B C', overlapped=True)` but not for `re.findall(r'A|A\\ B', 'A B C', overlapped=True)`. When the keyword is a substring of another keyword, regex for this method can not extract all the keywords. – Trista Mar 14 '22 at 08:45
  • Yes, that is true. I suggest you edit the question to make it clear as to why it is different from the duplicates I proposed. Then I'll remove my close vote. – Nick Mar 15 '22 at 04:05
  • Hi @Nick, I’ve updated the question. Thanks a lot for all the suggestions. – Trista Mar 15 '22 at 07:03
  • I think you probably will need to iterate over your keywords e.g. https://ideone.com/Fpxuz3 – Nick Mar 16 '22 at 23:51
  • Hi @Nick, yeah I think you are right. Could you make this comment as an answer so that I can accept the answer? Thanks for your help! – Trista Mar 24 '22 at 02:22

2 Answers2

1

In the situation where one keyword is a substring of another, you will need to iterate over your keywords as matching using regex will always pick one or the other (most modules such as re pick the first match in the alternation - see here) at a given point in the string, but never both. You could iterate over the keywords to ensure you find all matches using code like this:

import re
 
string = "A B C D"
keys = ["A", "B", "A B"]
 
matches = []
for k in keys:
    matches += re.findall(re.escape(k), string)
 
print(matches)

Output

['A', 'B', 'A B']

Demo on ideone

Nick
  • 138,499
  • 22
  • 57
  • 95
0

This pattern will find 3 matches in the string " A B ".
We look for a space with a lookback for \bA and a lookahead for B\b. The problem is that the second match returns the space and not the string A B.
You would have to replace a space with a A B

(\bA\b)|((?<=(\bA)) (?=B\b))|(\bB\b)
  • Hi @Kendle, yeah the pattern can extract all the matches. But I'm afraid it’s hard to expand to other patterns, like other keywords ‘A B C’, ‘A B’, ‘B C’, ‘C D’. I think it’s almost impossible to implement all the patterns manually or with a simple rule. – Trista Mar 15 '22 at 07:04