1

Adapting the top answer of php sentence boundaries detection.

Could anyone give me a hand re-jigging the above regex to match the content between sentence boundaries instead of the boundaries themselves?

That was built for preg_split, I'm needing it for preg_replace_callback.

Below is my attempt so far but can't get it to match the last sentence as it relies on the lookbehinds to check for the boundary:

http://regex101.com/r/nH7mC5 - this contains example output minus the last sentence.

Community
  • 1
  • 1
Sam Jarvis
  • 23
  • 4

1 Answers1

0

I am the author of the cited sentence splitting answer. Here's a modified version that may suit your purposes:

An enhanced regex solution

Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:

<?php // test.php Rev:20140218_1500
$re = '/# Match sentence ending in .!? followed by optional quote.
    (                  # $1: Sentence.
      [^.!?]+          # One or more non-end-of-sentence chars.
      (?:              # Zero or more not-end-of-sentence dots.
        \.             # Allow dot mid-sentence, but only if:
        (?:            # Group allowable dot alternatives.
          (?=[^\s\'"]) # Dot is ok if followed by non-ws,
        | (?<=         # or not one of the following:
            Mr\.       # Either "Mr."
          | Mrs\.      # or "Mrs.",
          | Ms\.       # or "Ms.",
          | Jr\.       # or "Jr.",
          | Dr\.       # or "Dr.",
          | Prof\.     # or "Prof.",
          | Sr\.       # or "Sr.",
          | T\.V\.A\.  # or "T.V.A.",
                       # or... (you get the idea).
          )            # End positive lookbehind.
        )              # Group allowable dot alternatives.
        [^.!?]*        # Zero or more non-end-of-sentence chars.
      )*               # Zero or more not-end-of-sentence dots.
      (?:              # Sentence end alternatives.
        [.!?]          # Either end of sentence punctuation
        [\'"]?         # followed by optional quote,
      | $              # Or end of string with no punctuation.
      )                # Sentence end alternatives.
    )                  # End $1: Sentence.
    (?:\s+|$)          # Sentence ends with ws or EOS.
    /ix';

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! Last sentence '.
        'with no ending punctuation';

$sentences = array(); // Initialize array of sentences.

function _getSentencesCallback($matches) {
    global $sentences;
    $sentences[] = $matches[1];
    return '';
}
preg_replace_callback($re, '_getSentencesCallback', $text);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:

This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!

Here is the output from the script:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
Sentence[11] = [Last sentence with no ending punctuation]

Hope this helps and Happy Regexing!

Edit: 2014-02-19 08:00 Last sentence at end of string no longer requires punctuation.

Community
  • 1
  • 1
ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • This is just beautiful.It doesn't seem to detect the end of string as end of sentence, seems to require one of the usual sentence end characters [.!?]. – Sam Jarvis Feb 19 '14 at 09:35
  • I've fixed the regex so that it now matches the last sentence when it has no end-of-sentence punctuation (and added a test case to the example test). – ridgerunner Feb 19 '14 at 15:18