1

I need to split a paragraph into sentences. That's where i got a bit confused with the regex.

I have already referred this question to which this Q is marked as a duplicate to. but the issue here is different.

Here is a example of the string i need to split :

hello! how are you? how is life
live life, live free. "isnt it?"

here is the code i tried :

$sentence_array = preg_split('/([.!?\r\n|\r|\n])+(?![^"]*")/', $paragraph, -1);

What i need is :

array (  
  [0] => "hello"  
  [1] => "how are you"  
  [2] => "how is life"  
  [3] => "live life, live free"  
  [4] => ""isnt it?""  
)

What i get is :

array(
  [0] => "hello! how are you? how is life live life, live free. "isnt it?""
)

When i do not have any quotes in the string, the split works as required.

Any help is appreciated. Thank you.

Prashanth Benny
  • 1,523
  • 21
  • 33
  • Possible duplicate of [Explode a paragraph into sentences in PHP](https://stackoverflow.com/questions/10494176/explode-a-paragraph-into-sentences-in-php) – H2ONOCK Sep 28 '18 at 08:10
  • 2
    You might try something like `'~"[^"]*"(*SKIP)(*F)|\s*[.!?\r\n]\s*~'`, see [demo](https://regex101.com/r/gToxu2/1). – Wiktor Stribiżew Sep 28 '18 at 08:11
  • @H2ONOCK i had seen that one. but my issue here is specific and different. I have the split working fine without quotation marks. – Prashanth Benny Sep 28 '18 at 08:14

2 Answers2

2

There are some problems with your regular expression that the main of them is confusing group constructs with character classes. A pipe | in a character class means a | literally. It doesn't have any special meaning.

What you need is this:

("[^"]*")|[!?.]+\s*|\R+

This first tries to match a string enclosed in double quotation marks (and captures the content). Then tries to match any punctuation marks from [!?.] set to split on them. Then goes for any kind of newline characters if found.

PHP:

var_dump(preg_split('~("[^"]*")|[!?.]+\s*|\R+~', <<<STR
hello! how are you? how is life
live life, live free. "isnt it?"
STR
, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));

Output:

array(5) {
  [0]=>
  string(5) "hello"
  [1]=>
  string(11) "how are you"
  [2]=>
  string(11) "how is life"
  [3]=>
  string(20) "live life, live free"
  [4]=>
  string(10) ""isnt it?""
}
revo
  • 47,783
  • 14
  • 74
  • 117
1

I view your problem of splitting based on certain punctuation already solved, except that it fails in the case of double quotes. We can phrase a solution as saying that we should split when seeing such punctuation, or when seeing this punctuation followed by a double quote.

The split should happen when the previous character matches one of your markers and what follows is not a double quote, or the previous two characters should be a marker and a double quote. This implies splitting on the following pattern, which uses lookarounds:

(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]")(?=.)

Code sample:

$input = "hello! how \"are\" \"you?\" how is life\nlive life, live free. \"isnt it?\"";
$sentence_array = preg_split('/(?<=[.!?\r\n])(?=[^"])|(?<=[.!?\r\n]\")(?=.)/', $input, -1);
print_r($sentence_array);

Array ( [0] => hello! [1] => how "are" "you?" [2] => how is life
    [3] => live life, live free. [4] => "isnt it?" )
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 2
    OP wants to match sentence end punctuation outside of `"`s. Lookarounds won't help, your solution will fail in many cases. – Wiktor Stribiżew Sep 28 '18 at 08:22
  • I have a small issue here, the `\r\n`, `\r` and `\n` are still in the string now. everything else is great. thank you Tim. – Prashanth Benny Sep 28 '18 at 08:38
  • 1
    I don't have a fix for that. I can only suggest removing them afterwards. The lookaround trick I used does not consume anything, this is why it leaves the double quotes untouched. But this also means that newlines/carriage returns would also not be removed. – Tim Biegeleisen Sep 28 '18 at 08:41