4

I have some text which is sentences, some of which are questions. I'm trying to create a regular expression which will extract only the questions which contain a specific phrase, namely 'NSF' :

import re
s = "This is a string. Is this a question? This isn't a question about NSF. Is this one about NSF? This one is a question about NSF but is it longer?"

Ideally, the re.findall would return:

['Is this one about NSF?','This one is a question about NSF but is it longer?']

but my current best attempt is:

re.findall('([\.\?].*?NSF.*\?)+?',s)
[". Is this a question? This isn't a question about NSF. Is this one about NSF? This one is a question about NSF but is it longer?"]

I know I need to do something with non-greedy-ness, but I'm not sure where I'm messing up.

Erik
  • 132
  • 1
  • 7
  • 1
    Try `r'\s*([^.?]*?NSF[^.?]*?[?])'` – Wiktor Stribiżew Oct 17 '16 at 17:06
  • @WiktorStribiżew Thanks! Can you explain the changes you made a bit to help my own understanding? – Erik Oct 17 '16 at 18:35
  • I was putting kids to bed. So, does it work for you? The point is I used negated character classes to match chunks of text other than specific characters. – Wiktor Stribiżew Oct 17 '16 at 19:13
  • I think the best solution is to tokenize the text into sentences with [`nltk`](http://www.nltk.org/) and parse sentences (see [this thread](http://stackoverflow.com/questions/17879551/nltk-find-if-a-sentence-is-in-a-questioning-form)). Regex is not going to work in many cases, think about abbreviations. – Wiktor Stribiżew Oct 17 '16 at 19:19
  • Yes, your solution worked exactly. I understand that some nltk parsing would be the best way, but I was really just looking for a quick hack. Basically I wanted a quick way to examine some of the question syntax in my corpus. There are several variations (ANSF, National Science Foundation, etc), but I just wanted a quick look. – Erik Oct 17 '16 at 19:53

1 Answers1

1

DISCLAIMER: The answer is not aiming at a generic interrogative sentence splitting solution, rather show how the strings supplied by OP can be matched with regular expressions. The best solution is to tokenize the text into sentences with nltk and parse sentences (see this thread).

The regex you might want to use for strings like the one you posted is based on matching all chars that are not final punctuation and then matching the subtring you want to appear inside the sentence, and then matching those chars other than final punctuation again. To negated a single character, use negated character classes.

\s*([^!.?]*?NSF[^!.?]*?[?])

See the regex demo.

Details:

  • \s* - 0+ whitespaces
  • ([^!.?]*?NSF[^.?]*?[?]) - Group 1 capturing
    • [^!.?]*? - 0+ chars other than ., ! and ?, as few as possible
    • NSF - the value you need to be present, a sequence of chars NSF
    • [^.?]*? - ibid.
    • [?] - a literal ? (can be replaced with \?)
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563