Regular expression in Python sentence extractor

Question

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.

Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'

I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:

def iterphrases(text):
    return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))

However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.

Just making sure you know, that logic won't work on sentences like `Nice to meet you. You can call me Mr. Smith.` — Hoopdady, Jan 14 '15 at 15:42
_"A sentence is defined as anything between 2 periods."_ Wouldn't this exclude the first sentence in a string? For example, in your post, "I have a script that gives me sentences that contain one of a specified list of key words" isn't between two periods. — Kevin, Jan 14 '15 at 15:44
@Kevin and the last sentence (as the delimiter is a period followed by a space). — Alex, Jan 14 '15 at 15:45
You could try sth. like `"[[.!?] [A-Z]"`, but even that can get some wrong results (as in Hoopdady's example). IIRC, Emacs used the convention of "two spaces after sentence" to recognize the end of a sentence. — tobias_k, Jan 14 '15 at 15:46
I know my documents won't have anything like Mr. Smith. Due to the nature of the documents so that's alright. However I can't change the convention of my documents. I'm new at regular expressions does [[.!?] [A-Z] mean period exclamation or question mark then any letter? Because that would mean it would mess up on number beginning sentences correct? — Jacob Ian, Jan 14 '15 at 15:50

score 3 · Accepted Answer · edited May 23 '17 at 11:50

if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):

re.split(r'\.\s', text)

Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:

re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)

and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize

nltk.tokenize.sent_tokenize(text)

L3viathan · Answer 2 · 2015-01-14T16:52:06.110

2

Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.

import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
    return (match.group(0) for match in sentence.finditer(text))

edited Jan 14 '15 at 16:52

answered Jan 14 '15 at 16:27

L3viathan

26,748
2
58
81

What does EOL mean? (sorry I'm trained to be a chemist my vocabulary for computer science has gaps) – Jacob Ian Jan 14 '15 at 16:32
by `trim` do you mean `strip`? – Aprillion Jan 14 '15 at 16:33
Aprillion: Yes, sorry. Jacob: EOL = End Of Line. – L3viathan Jan 14 '15 at 16:41
Just to help my understanding .*? means match any character 0 or more times. \. means a period and ( |$) means space OR new line correct? – Jacob Ian Jan 14 '15 at 16:46
Correct (almost. $ is end-of-line, even without a newline character.)! The question mark makes the * act non-greedy, it is by default greedy, which would mean it grabs the biggest chunk of text this applies to, so probably your complete text. – L3viathan Jan 14 '15 at 16:50
Changed it to a positive lookahead, and made a sentence start with at least one alphanumeric character. That means unneccessary spaces shouldn't be a problem anymore. The positive lookahead `(?= |$)` means: Make this whole match only valid, if it is followed by a space or the end of the line, *but* don't match the space. – L3viathan Jan 14 '15 at 16:55
1

@L3viathan `re.compile` is not using [`re.MULTILINE`](https://docs.python.org/2/library/re.html#re.MULTILINE) mode by default, you need to add the corresponding flag yourself if `$` is supposed to match end of lines – Aprillion Jan 14 '15 at 16:55
Thanks. I expected the text to not actually contain newlines and used $ for the end of the text, but I guess it can't hurt. – L3viathan Jan 14 '15 at 16:57
Yeah my actual documents actually do have new lines in them so that's helpful to know. thanks – Jacob Ian Jan 14 '15 at 18:14

score 0 · Answer 3 · answered Jan 14 '15 at 16:45

If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:

matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]

Regular expression in Python sentence extractor

3 Answers3