1

This is an extension of the PHP sentences boundaries question on SO.

I'd like know how to change the regex to keep newlines as well.

Sample code to split some text by sentence, remove one sentence, then put back together:

<?php
$re = '/# Split sentences on whitespace between them.
    (?<=                # Begin positive lookbehind.
      [.!?]             # Either an end of sentence punct,
    | [.!?][\'"]        # or end of sentence punct and quote.
    )                   # End positive lookbehind.
    (?<!                # Begin negative lookbehind.
      Mr\.              # Skip either "Mr."
    | Mrs\.             # or "Mrs.",
    | Ms\.              # or "Ms.",
    | Jr\.              # or "Jr.",
    | Dr\.              # or "Dr.",
    | Prof\.            # or "Prof.",
    | Sr\.              # or "Sr.",
    | T\.V\.A\.         # or "T.V.A.",
                        # or... (you get the idea).
    )                   # End negative lookbehind.
    [\s+|^$]            # Split on whitespace between sentences/empty lines.
    /ix';

$text = <<<EOL
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!
EOL;

echo "\nBefore: \n" . $text . "\n";

$sentences = preg_split($re, $text, -1);

$sentences[1] = " "; // remove 'sentence one'

// put text back together
$text = implode( $sentences );

echo "\nAfter: \n" . $text . "\n";
?>

Running this, the output is

Before: 
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

After: 
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!

I'm trying to get the 'After' text to be the same as the 'Before' text, just with the one sentence removed.

After: 
This is paragraph one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

I'm hoping this can be done with a regex tweak, but what am I missing?

Community
  • 1
  • 1
johnh10
  • 4,047
  • 1
  • 18
  • 30
  • 1
    Looks like there is an issue in this regex: `[\s+|^$]` really matches whitespace, `+`, `|`, `^` and `$` symbols. Replace that with `(?:\h+|^$)` and I guess that is it. – Wiktor Stribiżew Dec 02 '15 at 21:08
  • I think you can just remove the `+` after the `\s` or `\s{1}` if you really need it to match one, because the `\s+` is consuming the other whitespaces. Essentially you need `array( "stuf", "\n", "stuff");` but not sure without testing it, and it's too complicated to run in just my head. – ArtisticPhoenix Dec 02 '15 at 21:17

1 Answers1

1

The end of the pattern should be replaced with:

  (?:\h+|^$)          # Split on whitespace between sentences\/empty lines.
/mix';

See IDEONE demo

Note that [\s+|^$] really matches whitespace (both horizontal and vertical, like newlines), +, |, ^ and $ symbols because it is a character class.

Instead of a character class, a group (better, non-capturing here) is necessary. Inside a group (marked with (...)) the | works as an alternation operator.

Instead of \s, I suggest using \h that matches horizontal whitespace (no linebreaks) only.

The ^$ will only match an empty string if no /m multiline modifier is used. So, I have added /m modifier to the options.

And note that I had to escape the / inside the last comment, otherwise there was a warning that the regex is incorrect. Or, use different regex delimiters.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks. This almost works, with one quirk: the preg_split regex combines two of the sentences together. See http://ideone.com/AUImET Any idea? Also thanks for \h explanation I wasn't familiar with it. – johnh10 Dec 02 '15 at 21:31
  • What if you add a `PREG_SPLIT_DELIM_CAPTURE`, use a capturing group with `(\h+|^$)` and zero out the element at Index 2? See [this demo](http://ideone.com/ddq1hV). – Wiktor Stribiżew Dec 02 '15 at 21:41