re.findall() isn't as greedy as expected - Python 2.7

Question

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:

import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences

Per this regex tester, I should, in theory, be getting a list like this:

>>> ["Hello World!", "This is your captain speaking."]

But the output I am actually getting is like this:

>>> [' World', ' speaking']

The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.

When you use capture groups with re.findall, it returns only the set of captures but not the whole match. Change your capture group `(...)` to a non-capturing group `(?:...)`. *(and the first `\w+` to `\w*`)*. Your problem has nothing to do with greediness. — Casimir et Hippolyte, May 06 '17 at 21:28
This is not an exact duplicate of http://stackoverflow.com/questions/31915018/python-re-findall-behaves-weird . In that question, there was a confounding issue of double escape ``\\`` inside a raw-string. This question more clearly gets to the heart of a single issue, the behavior of re.findall() when given capturing groups. — Raymond Hettinger, May 06 '17 at 22:53

Raymond Hettinger · Answer 1 · 2017-05-06T21:41:56.807

The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

It is easy to see what is going on using re.finditer() and exploring the match objects:

>>> import re
>>> text = "Hello World! This is your captain speaking."

>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)

>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)

>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)

The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:

>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'

score 0 · Answer 2 · answered May 08 '17 at 01:00

0

You can change your regex somewhat:

>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

answered May 08 '17 at 01:00

dawg

98,345
23
131
206

re.findall() isn't as greedy as expected - Python 2.7

2 Answers2