0

I'm trying to split a string of sentences by "." to get each sentence in an array. Like below:

$Text = "Hello, Mr. James. How are you today."
$split= explode(".", $Text);

As you can see $Text contains 2 sentences therefore i should only have 2 elements in the array. The issue i'm having is that sometimes my $Text can contain words like "Mr." or any other word which contains a "." in the middle of a sentence. This will result in the sentences being split from the middle and placed separately in the array like below:

Array ( [0] => Hello, Mr [1] => James [2] => How are you today [3] => )
user3837019
  • 211
  • 1
  • 2
  • 16
  • 1
    Possible duplicate: [Split a text into sentences](http://stackoverflow.com/q/16377437/4577762) – FirstOne Apr 23 '17 at 18:50
  • 1
    @FirstOne Almost. The one you're linking relies on whether there's a letter after a dot / space (Mr. James) would register as 2 sentences with that. I think the only way to do this would be check if the first word after a dot is a valid English dictionary word or just a name. Even so it would be very unreliable – icecub Apr 23 '17 at 18:55
  • 1
    @icecub I didn't flag, just wrote _possible_ ;). Anyways, a google search returned a bunch of results (such as [this](http://stackoverflow.com/questions/2158296/how-to-split-a-paragraph-into-sentences)), but I doubt the op went through all of them. – FirstOne Apr 23 '17 at 19:02
  • @FirstOne Ye I found Natural Language Processing as well. But it's quite a complicated subject. And I wonder how reliable it really is. – icecub Apr 23 '17 at 19:12
  • https://regex101.com/r/nG1gU7/27 – Peter Apr 23 '17 at 19:35
  • @Peter Close, but it fails on other honorifics like "Miss. James" – icecub Apr 23 '17 at 19:38
  • 2
    I think the easiest solution would simply be `/(?<! Mr| Mrs)\./i` and expending the list with all honorifics. – icecub Apr 23 '17 at 19:41
  • 2
    Yes, there isn't perfect solution. There is the closest solution WITH tons of exceptions :) – Peter Apr 23 '17 at 19:43
  • 1
    @Peter Exactly. There's no perfect solution for this problem. Hence I'm not turning this into an answer, haha – icecub Apr 23 '17 at 19:45

1 Answers1

1

You can avoid a lot of exception handling and general misery, if you can ensure that all English sentences are properly spaced at the end of each sentence -- 2 consecutive spaces. This can be difficult when dealing with some digitized strings because sometimes multi-spacing gets condensed to a single space.

This is what I mean:

$Text = "Hello, Mr. James.  How are you today.";
$split = explode("  ", $Text);
var_export($split);
// array ( 0 => 'Hello, Mr. James.', 1 => 'How are you today.', )

Exploding on each space-space will give you a reliable result. If you want good output, you'll need to use good input.


If you want to blacklist a few predictable substrings that should not be use to split the string, then you can use (*SKIP)(*FAIL) for that.

Code: (Demo)

$text = "Hello, Mr. James. How are you today.";

var_export(
    preg_split('~(?:Mrs?|Miss|Ms|Prof|Rev|Col|Dr)[.?!:](*SKIP)(*F)|[.?!:]+\K\s+~', $text, 0, PREG_SPLIT_NO_EMPTY)
);

Output:

array (
  0 => 'Hello, Mr. James.',
  1 => 'How are you today.',
)
mickmackusa
  • 43,625
  • 12
  • 83
  • 136