You don't need to escape whitespace.
import re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I)
From the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings.
This counts as an overlapping match.
Returning overlapping matches
You can use a lookahead to capture the string you're looking for, but because it's technically matching on the lookahead they are not overlapping.
import re
text = "oracle sql"
regex = "(?=(oracle sql|sql))"
print re.findall(regex, text, re.I)
Output:
['oracle sql', 'sql']
See it in action.
The downside of this implementation is that it will only find 1 match for each word at a particular position in a string. This is due to overlapping matches.
For example (my test|my|test)
will only find ['my test', 'test']
.
You could always use a regex replacement that will find overlapping matches too, such as regex, but this will still only find ['my test', 'test']
with the pattern (my test|my|test)
:
import regex as re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I, overlapped=True)
Recursion
Regex will only find one match per character. It has already found the match for the first character based on "oracle sql" so you can't get a match on just oracle
. You can't find every single one.
However... You could use a recursive function to try to match the same string with all of the items - what has already been matched.
I am not sure how performant this code will be as you could execute a lot of regex searches.
import re
def find_all_matches(text, items):
regex_items = '|'.join(items)
regex = "(?=({}))".format(regex_items)
matches = re.findall(regex, text, re.I)
new_items = [i for i in items if i not in matches]
if new_items:
new_matches = find_all_matches(text, new_items)
return matches + new_matches
return matches
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])
Output:
['oracle sql', 'sql', 'oracle']
No regex
Lastly you could implement this without regex. Again I haven't looked at the performance of this.
def find_all_matches(text, items):
return [i for i in items if i in text]
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])
Output:
['oracle sql', 'oracle', 'sql']