Suppose I would like to search for a bunch of tags in a string where some of the tags can be substrings of other tags. For example, I would like to search the tags ["UC", "UC Berkeley", "Berkeley"] in the text "He attended UC Berkeley last year." I would expect to get all three tags to show up. However, when I run this in Python, I only get "UC" and "Berkeley":
import re
string = "He attended UC Berkeley last year."
compiled_regexp = re.compile("UC|UC Berkeley|Berkeley", re.IGNORECASE)
re.findall(compiled_regexp, string)
# result is: ['UC', 'Berkeley']
How can I get all three tags to show up?
My actual use case involves tens of thousands of tags many of which are prefixes of other tags. There are also tags that are prefixes of other tags that are themselves prefixes of other tags and so on (like ["UC", "UCB", "UCBA" ...]) It would be infeasible to manually create capturing groups for all of the prefixes of other tags. Is there a better way to do this?
Update:
I've decided to do the following:
First, I find all tags that are prefixes of other tags. Then I build two separate regular expression, one for prefixing tags and another for non-prefixing tags. Finally I search the string with both regular expressions and combine the results.