6

Possible Duplicate:
PHP - How to split a paragraph into sentences.

I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized there were some problems with this, such as when people use acronyms, or end a sentence with something like !?. What would be the best way to handle this? I figured there would be some regex that could handle this, but I'm open to a non-regex solution if that fits the problem better.

Community
  • 1
  • 1
GSto
  • 41,512
  • 37
  • 133
  • 184

3 Answers3

2

Regex isn't the best solution for this problem. You'd be served better by creating a parsing library. Something where you an easily create logic blocks to distinguish one thing from another. You'll need to come up with a set of rules breaking up the text into the chunks you'd like to see.

"Are you sure?" he asked.

Doesn't that mess things up when using regex? However, with a parser you could actually see

<start quote><capitalization>are you sure<question><end quote>he asked<period>

that with simple rules could say "that's one sentence."

wheaties
  • 35,646
  • 15
  • 94
  • 131
  • 1
    Or, annoyingly, you could get things like `"Are you sure"? he asked.` which are semantically correct but look oh so wrong. Also, nouns which contain punctuation are also bad: `Which? recommend buying....` – Callum Rogers Sep 09 '10 at 16:28
  • Actually the ? should be inside the quotes. – CrayonViolent Sep 09 '10 at 16:30
1

Unfortunately there is no perfect solution for this, for the very reasons you stated. If it is content that you can somehow control or force a specified delimiter after every sentence, that would be ideal. Beyond that, all you can really do is look for (\.|!|?)+ and maybe even throw in a \s after that since most people pad new sentences with 1 or 2 spaces between the previous and next sentence.

CrayonViolent
  • 32,111
  • 5
  • 56
  • 79
0

I think the biggest problem is the possible existence of acronyms! Therefore you must use something like Prof.&nbsp;Knuth in a JavaDoc summary sentence so that the javadoc generator don't thinks that the first sentence ends after Prof.. This is a problem I don't know how anyone can reliably handle. The only approximate solution I could imagine is the use of an abbreviation dictionary.

splash
  • 13,037
  • 1
  • 44
  • 67