Regex for combinational word matching using Python

Question

Trying to find multiple word match in given text.For example :

text = "oracle sql"
regex = "(oracle\\ sql|sql)"
re.findall(regex,text,re.I)

Output actual

oracle sql

Expected output

oracle sql,sql

Can anyone tell me, where is problem with regex expression ?

Updated:

@jim it won't work ,if multiple overlapping comes, for example :

re.findall("(?=(spark|spark sql|sql))","spark sql",re.I)

Actual Output

['spark','sql']

Expected Output :

['spark','sql','spark sql']]

Note : In the above case if both are matched then it won't match combination of words.

Updated :

Check link : repl.it/repls/NewFaithfulMath

Which version of python are you using? I'm getting `findall() got an unexpected keyword argument 'flag'` — Jim Wright, Aug 15 '18 at 16:43
Possible duplicate of [How to find overlapping matches with a regexp?](https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp) — Paolo, Aug 16 '18 at 08:08
@UnbearableLightness My major point is how to get overlapping matched words also how it can be duplicate.Can you give a try on this :- re.findall("(?=(spark|spark sql|sql))","spark sql",re.I) — Arpit, Aug 16 '18 at 09:05
See [this answer](https://stackoverflow.com/a/18966698/3390419). — Paolo, Aug 16 '18 at 10:18
@UnbearableLightness see this link : repl.it/repls/NewFaithfulMath — Arpit, Aug 16 '18 at 10:26
@UnbearableLightness I have tried this also but not working with regex also — Arpit, Aug 16 '18 at 10:29

Jim Wright · Answer 1 · 2018-08-16T10:15:50.833

You don't need to escape whitespace.

import re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I)

From the documentation:

Return all non-overlapping matches of pattern in string, as a list of strings.

This counts as an overlapping match.

Returning overlapping matches

You can use a lookahead to capture the string you're looking for, but because it's technically matching on the lookahead they are not overlapping.

import re
text = "oracle sql"
regex = "(?=(oracle sql|sql))"
print re.findall(regex, text, re.I)

Output:

['oracle sql', 'sql']

See it in action.

The downside of this implementation is that it will only find 1 match for each word at a particular position in a string. This is due to overlapping matches.

For example (my test|my|test) will only find ['my test', 'test'].

You could always use a regex replacement that will find overlapping matches too, such as regex, but this will still only find ['my test', 'test'] with the pattern (my test|my|test):

import regex as re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I, overlapped=True)

Recursion

Regex will only find one match per character. It has already found the match for the first character based on "oracle sql" so you can't get a match on just oracle. You can't find every single one.

However... You could use a recursive function to try to match the same string with all of the items - what has already been matched.

I am not sure how performant this code will be as you could execute a lot of regex searches.

import re

def find_all_matches(text, items):
  regex_items = '|'.join(items)
  regex = "(?=({}))".format(regex_items)
  matches = re.findall(regex, text, re.I)
  new_items = [i for i in items if i not in matches]
  if new_items:
    new_matches = find_all_matches(text, new_items)
    return matches + new_matches
  return matches
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])

Output:

['oracle sql', 'sql', 'oracle']

No regex

Lastly you could implement this without regex. Again I haven't looked at the performance of this.

def find_all_matches(text, items):
  return [i for i in items if i in text]

print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])

Output:

['oracle sql', 'oracle', 'sql']

@jim this won't work if we put : re.findall("(?=(spark|spark sql|sql))","spark sql",re.I) — Arpit, Aug 15 '18 at 18:39
@jim Can you check in this : https://repl.it/repls/NewFaithfulMath, why not working ? — Arpit, Aug 16 '18 at 08:50
@Arpit Regex will only find one match per character. It has already found the match for the first character based on "oracle sql". You can't find every single one. — Jim Wright, Aug 16 '18 at 10:06
@JimWright Right now i did the same but i didn't find any suitable regex for this — Arpit, Aug 16 '18 at 10:30

Regex for combinational word matching using Python

1 Answers1

Returning overlapping matches

Recursion

No regex