4

I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.

Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'

I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:

def iterphrases(text):
    return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))

However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Jacob Ian
  • 669
  • 3
  • 9
  • 17
  • 5
    Just making sure you know, that logic won't work on sentences like `Nice to meet you. You can call me Mr. Smith.` – Hoopdady Jan 14 '15 at 15:42
  • 1
    _"A sentence is defined as anything between 2 periods."_ Wouldn't this exclude the first sentence in a string? For example, in your post, "I have a script that gives me sentences that contain one of a specified list of key words" isn't between two periods. – Kevin Jan 14 '15 at 15:44
  • @Kevin and the last sentence (as the delimiter is a period followed by a space). – Alex Jan 14 '15 at 15:45
  • You could try sth. like `"[[.!?] [A-Z]"`, but even that can get some wrong results (as in Hoopdady's example). IIRC, Emacs used the convention of "two spaces after sentence" to recognize the end of a sentence. – tobias_k Jan 14 '15 at 15:46
  • 1
    I know my documents won't have anything like Mr. Smith. Due to the nature of the documents so that's alright. However I can't change the convention of my documents. I'm new at regular expressions does [[.!?] [A-Z] mean period exclamation or question mark then any letter? Because that would mean it would mess up on number beginning sentences correct? – Jacob Ian Jan 14 '15 at 15:50

3 Answers3

3

if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):

re.split(r'\.\s', text)

Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:

re.split(r'\.\s', re.sub(r'\.\s*$', '', text))

also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)

and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize

nltk.tokenize.sent_tokenize(text)
Community
  • 1
  • 1
Aprillion
  • 21,510
  • 5
  • 55
  • 89
2

Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.

import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
    return (match.group(0) for match in sentence.finditer(text))
L3viathan
  • 26,748
  • 2
  • 58
  • 81
  • What does EOL mean? (sorry I'm trained to be a chemist my vocabulary for computer science has gaps) – Jacob Ian Jan 14 '15 at 16:32
  • by `trim` do you mean `strip`? – Aprillion Jan 14 '15 at 16:33
  • Aprillion: Yes, sorry. Jacob: EOL = End Of Line. – L3viathan Jan 14 '15 at 16:41
  • Just to help my understanding .*? means match any character 0 or more times. \. means a period and ( |$) means space OR new line correct? – Jacob Ian Jan 14 '15 at 16:46
  • Correct (almost. $ is end-of-line, even without a newline character.)! The question mark makes the * act non-greedy, it is by default greedy, which would mean it grabs the biggest chunk of text this applies to, so probably your complete text. – L3viathan Jan 14 '15 at 16:50
  • Changed it to a positive lookahead, and made a sentence start with at least one alphanumeric character. That means unneccessary spaces shouldn't be a problem anymore. The positive lookahead `(?= |$)` means: Make this whole match only valid, if it is followed by a space or the end of the line, *but* don't match the space. – L3viathan Jan 14 '15 at 16:55
  • 1
    @L3viathan `re.compile` is not using [`re.MULTILINE`](https://docs.python.org/2/library/re.html#re.MULTILINE) mode by default, you need to add the corresponding flag yourself if `$` is supposed to match end of lines – Aprillion Jan 14 '15 at 16:55
  • Thanks. I expected the text to not actually contain newlines and used $ for the end of the text, but I guess it can't hurt. – L3viathan Jan 14 '15 at 16:57
  • Yeah my actual documents actually do have new lines in them so that's helpful to know. thanks – Jacob Ian Jan 14 '15 at 18:14
0

If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:

matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]
Alex
  • 18,484
  • 8
  • 60
  • 80