1

Trying to find multiple word match in given text.For example :

text = "oracle sql"
regex = "(oracle\\ sql|sql)"
re.findall(regex,text,re.I)

Output actual

oracle sql

Expected output

oracle sql,sql

Can anyone tell me, where is problem with regex expression ?

Updated:

@jim it won't work ,if multiple overlapping comes, for example :

re.findall("(?=(spark|spark sql|sql))","spark sql",re.I)

Actual Output

['spark','sql']

Expected Output :

['spark','sql','spark sql']]

Note : In the above case if both are matched then it won't match combination of words.

Updated :

Check link : repl.it/repls/NewFaithfulMath

Arpit
  • 448
  • 8
  • 26

1 Answers1

3

You don't need to escape whitespace.

import re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I)

From the documentation:

Return all non-overlapping matches of pattern in string, as a list of strings.

This counts as an overlapping match.

Returning overlapping matches

You can use a lookahead to capture the string you're looking for, but because it's technically matching on the lookahead they are not overlapping.

import re
text = "oracle sql"
regex = "(?=(oracle sql|sql))"
print re.findall(regex, text, re.I)

Output:

['oracle sql', 'sql']

See it in action.

The downside of this implementation is that it will only find 1 match for each word at a particular position in a string. This is due to overlapping matches.

For example (my test|my|test) will only find ['my test', 'test'].

You could always use a regex replacement that will find overlapping matches too, such as regex, but this will still only find ['my test', 'test'] with the pattern (my test|my|test):

import regex as re
text = "oracle sql"
regex = "(oracle sql|sql)"
print re.findall(regex, text, re.I, overlapped=True)

Recursion

Regex will only find one match per character. It has already found the match for the first character based on "oracle sql" so you can't get a match on just oracle. You can't find every single one.

However... You could use a recursive function to try to match the same string with all of the items - what has already been matched.

I am not sure how performant this code will be as you could execute a lot of regex searches.

import re

def find_all_matches(text, items):
  regex_items = '|'.join(items)
  regex = "(?=({}))".format(regex_items)
  matches = re.findall(regex, text, re.I)
  new_items = [i for i in items if i not in matches]
  if new_items:
    new_matches = find_all_matches(text, new_items)
    return matches + new_matches
  return matches
print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])

Output:

['oracle sql', 'sql', 'oracle']

No regex

Lastly you could implement this without regex. Again I haven't looked at the performance of this.

def find_all_matches(text, items):
  return [i for i in items if i in text]

print find_all_matches("oracle sql", ['oracle sql', 'oracle', 'sql'])

Output:

['oracle sql', 'oracle', 'sql']
Jim Wright
  • 5,905
  • 1
  • 15
  • 34