I am attempting to pull a list of complete sentences out of a body of plaintext using a regular expression in python 2.7. For my purposes, it is not important that everything that could be construed as a complete sentence should be in the list, but everything in the list does need to be a complete sentence. Below is the code that will illustrate the issue:
import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences
Per this regex tester, I should, in theory, be getting a list like this:
>>> ["Hello World!", "This is your captain speaking."]
But the output I am actually getting is like this:
>>> [' World', ' speaking']
The documentation indicates that the findall searches from left to right and that the * and + operators are handled greedily. Appreciate the help.