0

I thought I was good enough with RegEx that I could read most any one, but this simple one (in Python) has me baffled. www.regexpal.com gives a different result than iPython.

data = 'four year entrepreneurial program. Students develop and run a business, gain much needed ...'

m = re.compile('entrepreneur|business\s(plan|model)')

m.findall(data)

gives ['']

how can that be right? If I wrap the whole thing in parens, it works better but still returns an empty string as a match:

m = re.compile('(entrepreneur|business\s(plan|model))')

m.findall(data)

gives [('entrepreneur', '')]

As I said, the first one works on www.regexpal.com. I also tested it in Python (not iPython) and it fails there too.

smci
  • 32,567
  • 20
  • 113
  • 146
Brad
  • 735
  • 1
  • 8
  • 15
  • 1
    What do you expect to find, and why? – jgritty May 22 '14 at 22:30
  • Note that you should use `re.match` or `re.search` if you're comparing it with regexpal.`m.search(data).group()` -> `'entrepreneur'` – Ashwini Chaudhary May 22 '14 at 22:34
  • 1
    `.findall` is working as expected: *Return a list of all non-overlapping matches in the string.* Since in the first one `'entrepreneur'` is not in a group it is not returned by `.findall`. – Ashwini Chaudhary May 22 '14 at 22:38
  • I was expecting ["entrepreneur"], because I think of | as an "OR", and the first option matches and the second doesn't. I understand now the need for wrapping parens and the ?: for the plan|model group. – Brad May 24 '14 at 13:29
  • However, it's still very non-intuitive. I just tried m.match(data) and got nothing. I don't understand why it doesn't match 'entrepreneur'. Sorry I see that .match() tries to match the whole string. – Brad May 24 '14 at 13:33

3 Answers3

2

findall collects groups' values. It doesn't return the whole matched substring. Your pattern

entrepreneur|business\s(plan|model)

loops through the data string until it finds the match. Once the match is found (here entrepreneurial program...) it stops there and captures the value of the first group (which is empty). Then it runs further, but doesn't find any matches. So the final result is a list with one empty string.

To observe behaviour similar to regexpal, parenthesize the whole expression and make other groups optional:

>>> re.findall(r'(entrepreneur|business\s(?:plan|model))', data)
['entrepreneur']
gog
  • 10,367
  • 2
  • 24
  • 38
  • thanks, ok, that makes sense. But why [('entrepreneur', '')] if I wrap the whole thing in parens? 'entrepreneur' matches the first, but nothing matches the second. So why the ''? – Brad May 22 '14 at 22:43
  • @Brad: a re engine cannot match "nothing". It's always something (provided the whole match succeeded). – gog May 22 '14 at 22:50
1

The issue is the parentheses. They create a capturing group, which with your example string is unmatched (the ungrouped entrepreneur part of the pattern matches instead). re.findall returns the a tuple of the capturing group results if there are any groups in the pattern, so that's why you're getting an empty string. In the second version of your code, you have two groups, the first covers the whole pattern while the second again covers only the plan|model part (which is not matched).

If you use a non-capturing group ((?:X)) for the plan|model alternation you'll probably get the results you expect (the text "entrepreneur"), as re.findall returns the whole matched text if there are no capturing groups.

Try: "entrepreneur|business\s(?:plan|model)"

Blckknght
  • 100,903
  • 11
  • 120
  • 169
1

This is just how capturing groups work with findall.

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

You have a capturing group in the right hand side of your alternation, but the left hand side of the alternation matches your string.

entrepreneur|business\s(plan|model)

Regular expression visualization

Debuggex Demo

Thus, the group is empty since the left hand side matched, and that's what findall gives you.

To fix, make your group non-capturing:

entrepreneur|business\s(?:plan|model)

Now, there are no groups so findall returns what your main expression matched.

roippi
  • 25,533
  • 4
  • 48
  • 73