4

I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:

import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences

Per this regex tester, I should, in theory, be getting a list like this:

>>> ["Hello World!", "This is your captain speaking."]

But the output I am actually getting is like this:

>>> [' World', ' speaking']

The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.

  • 4
    When you use capture groups with re.findall, it returns only the set of captures but not the whole match. Change your capture group `(...)` to a non-capturing group `(?:...)`. *(and the first `\w+` to `\w*`)*. Your problem has nothing to do with greediness. – Casimir et Hippolyte May 06 '17 at 21:28
  • Yep, that worked. Thanks a bundle. – Lee Richards May 06 '17 at 21:40
  • This is not an exact duplicate of http://stackoverflow.com/questions/31915018/python-re-findall-behaves-weird . In that question, there was a confounding issue of double escape ``\\`` inside a raw-string. This question more clearly gets to the heart of a single issue, the behavior of re.findall() when given capturing groups. – Raymond Hettinger May 06 '17 at 22:53

2 Answers2

7

The issue is that findall() is showing just the captured subgroups rather than the full match. Per the docs for re.findall():

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

It is easy to see what is going on using re.finditer() and exploring the match objects:

>>> import re
>>> text = "Hello World! This is your captain speaking."

>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)

>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)

>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)

The solution to your problem is to suppress the subgroups with ?:. Then you get the expected results:

>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
0

You can change your regex somewhat:

>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']
dawg
  • 98,345
  • 23
  • 131
  • 206