Regular Expression overlap inclusion

Question

Suppose I would like to search for a bunch of tags in a string where some of the tags can be substrings of other tags. For example, I would like to search the tags ["UC", "UC Berkeley", "Berkeley"] in the text "He attended UC Berkeley last year." I would expect to get all three tags to show up. However, when I run this in Python, I only get "UC" and "Berkeley":

import re
string = "He attended UC Berkeley last year."
compiled_regexp = re.compile("UC|UC Berkeley|Berkeley", re.IGNORECASE)

re.findall(compiled_regexp, string)
# result is: ['UC', 'Berkeley']

How can I get all three tags to show up?

My actual use case involves tens of thousands of tags many of which are prefixes of other tags. There are also tags that are prefixes of other tags that are themselves prefixes of other tags and so on (like ["UC", "UCB", "UCBA" ...]) It would be infeasible to manually create capturing groups for all of the prefixes of other tags. Is there a better way to do this?

Update:
I've decided to do the following:
First, I find all tags that are prefixes of other tags. Then I build two separate regular expression, one for prefixing tags and another for non-prefixing tags. Finally I search the string with both regular expressions and combine the results.

Which version of Python are you using. I got my result using Python 3.7.2 — Jonathan, Jan 01 '19 at 21:09
Possible duplicate of [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) — iz_, Jan 01 '19 at 21:13
@Jonathan I'm using Python 3.6.0. @Tomothy32 If I use the regex module and set the overlapped flag to true, I get the same results: `import regex as re string = "He attended UC Berkeley last year." compiled_regexp = re.compile("UC|UC Berkeley|Berkeley", re.IGNORECASE, overlapped=True) re.findall(compiled_regexp, string)` — pcsram, Jan 01 '19 at 21:18
@Tomothy32 I don't believe that this question is an exact duplicate of [Python regex find all overlapping matches?](https://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches). The current question concerns overlapping target substrings that start at the same position, whereas the other question does not. — Xukrao, Jan 01 '19 at 22:10
What should happen when a target string occurs multiple times in the to-be-searched string? Do you just need to know that the target string is present, or do you need to have a count of the occurrences? — Xukrao, Jan 01 '19 at 23:05
@Xukrao I just need to know that the target string is present. No need for count of occurrences. — pcsram, Jan 01 '19 at 23:28

Jonathan · Answer 1 · 2019-01-01T21:03:29.173

0

re.findall() does not support overlapping matches and 'UC' overlaps with 'UC Berkley' as well as the overlap between 'Berkley and 'UC Berkley'.

edited Jan 01 '19 at 21:03

answered Jan 01 '19 at 20:53

Jonathan

2,635
3
30
49

Incidentally I get a match of 'UC' alone when using your code. – Jonathan Jan 01 '19 at 20:54
Also, once a match is found it will not look for other possible matches. – Jonathan Jan 01 '19 at 21:02

Xukrao · Accepted Answer · 2019-01-02T16:25:34.720

0

Solution for small amount of target strings

If you have only a few target strings, then it's still feasible to manually contruct a regex pattern and perform a search like so:

import re
string = "He attended UC Berkeley last year."
compiled_regexp = re.compile(r"((UC) (Berkeley)|UC|Berkeley)", re.IGNORECASE)

matches = re.findall(compiled_regexp, string)
print(matches)

gives as output:

[('UC Berkeley', 'UC', 'Berkeley')]

For some more clarification on this regex pattern, see regex101.

General solution

I don't know of any easy way to use regex to search for a large amount of overlapping same-starting-position target strings (it appears that regex just isn't really designed for this scenario). However as long as your target strings are fixed, a list comprehension should be able to do the job:

string = "He attended UC Berkeley last year."
targets = ["UC Berkeley", "UC", "Berkeley"]
string_lower = string.lower()
found = [target for target in targets if target.lower() in string_lower]
print(found)

which gives as output:

['UC Berkeley', 'UC', 'Berkeley']

edited Jan 02 '19 at 16:25

answered Jan 01 '19 at 21:25

Xukrao

8,003
5
26
52

More simply: re.compile("((UC) (Berkeley))", re.IGNORECASE) – Ed Callahan Jan 01 '19 at 22:08
This works for this particular example. But what I'm actually trying to deal with is a large list (tens of thousands) of tags many of which are prefixes of others. I can't manually write grouped regular expressions for all of the prefixes (thousands). Is there a simpler way of doing this? – pcsram Jan 01 '19 at 22:22
Thanks @Xukrao Using a list comprehension would be pretty costly in my case since the search string is rather large and I'll be searching for tens of thousands of tags. So what I've opted to do instead is find all tags that are prefixes of other tags. Then I build two separate regular expression, one for prefixing tags and another for non-prefixing tags. Finally I search the string with both regular expressions and combine the results. – pcsram Jan 02 '19 at 06:23
@pcsram Out of curiosity: did you try the list comprehension solution on your full dataset? List comprehensions and `in` string membership tests are pretty well optimized python constructs. I think the execution will be faster than you might expect. Searching for 10000 target strings in a 13000 words text takes less than 1 second on my machine for example. – Xukrao Jan 02 '19 at 16:25

Regular Expression overlap inclusion

2 Answers2

Solution for small amount of target strings

General solution