2

I've been trying to use:

$string="The Dr. is here!!! I am glad I'm in the U.S.A. for the Dr. quality is great!!!!!!";
preg_match_all('~.*?[?.!]~s',$string,$sentences);
print_r($sentences);

But it doesn't work on Dr., U.S.A., etc.

Does anyone have any better suggestions?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Scott Tyler
  • 67
  • 2
  • 6
  • I don't know regex well enough, but I was thinking if there's a way to say the sentence before the last has to be at least 5 character long or something like that... – Scott Tyler Jan 28 '10 at 22:04
  • Something like this: (\w+'?\s?)+\. – Adam Taylor Jan 28 '10 at 22:08
  • 1
    Well, to meet the case provided you'll want a regex that checks for a space followed by an uppercase letter, before it does the split. I'm not familiar with Regex's, probably you could do this though, but I think the rules may soon get more complicated, and probably you'd use a combination of a simple regex + a little state machine to do it properly. – Noon Silk Jan 28 '10 at 22:14
  • 1
    The space followed by an uppercase letter won't necessarily work. Imagine working with this sentence: `Hello, Dr. Smith is ready for you. Please go to the E.R. where he is waiting.` – Aaron Jan 28 '10 at 22:26
  • 1
    Aaron: That's why I said you'd need to combine it with a state machine. – Noon Silk Jan 28 '10 at 22:28
  • @Aaron and silky - What's a state machine? Sorry, I've not been formally trained, and might know what it is, but not what its called. I've looked on wiki, didn't really help... – Scott Tyler Jan 28 '10 at 22:33
  • 1
    Scott: It's really just an area where you decide on the course of action based on the current 'state' of some variables. So you'll be at the '.' and you'll have a 'previousWord' of 'Dr'. You can then look that up in a list of, say, "Legal words ending in . but not ending a sentence" (or some further-complicated model) and decide whether to break into a sentence at that point. – Noon Silk Jan 28 '10 at 22:51
  • http://stackoverflow.com/questions/5032210/php-sentence-boundaries-detection – giorgio79 Oct 07 '11 at 16:47

3 Answers3

11

there is not any simple solution for that. you need do some natural language processing(NLP) in your application and recognize each sentence. there is something call OpenNLP, it's a JAVA-based NLP parser tool. Or Stanford NLP parser in Ruby. you can find something like that for php.

here I found a set of classes for natural language processing in PHP.

Michel Gokan Khan
  • 2,525
  • 3
  • 30
  • 54
  • +1 - and indeed, even a solution that uses NLP is likely to fail when faced with sufficiently informal (e.g. sloppy) writing, If people don't follow the basic rules of punctuation, you are stuffed. – Stephen C Jan 28 '10 at 22:23
  • Seems like the files for that project are no longer online – Chris Harrison Apr 03 '12 at 09:12
1

hmmm maybe try something like $sentences = preg_split('/.*?[?.!]+\s+/', $string);

prodigitalson
  • 60,050
  • 10
  • 100
  • 114
0

This is almost impossible since your example clearly indicates that punctuation characters that can be used in e.g. Dr., U.S.A etc, make it impossible to know where a sentence starts/ends.

You have to search the following characters to decide if a new sentence follows (starts after) the punctuation chars you are mentioning.

Andreas
  • 5,305
  • 4
  • 41
  • 60