Split string into sentences using regex

Question

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?][\'"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr\.              # Skip either "Mr."
        | Mrs\.             # or "Mrs.",
        | T\.V\.A\.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        \s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

It works fine.

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties.Â Fairy Tail and Tokyo Ghoul.';

Or this scenario:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

What can I do to make it work when unicode characters exist in the text?

Here is an ideone for testing.

Bounty info

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribiżew for more relevant info on this issue.

I will bounty this question with 50 points once it is eligible. — Henrik Petterson, Jan 19 '16 at 16:22
@WiktorStribiżew Can you please demonstrate with an answer? — Henrik Petterson, Jan 19 '16 at 16:23
@WiktorStribiżew Yes, if I remove the unicode character, it works fine, see example: http://ideone.com/ZQhPSV — Henrik Petterson, Jan 19 '16 at 16:28
Aha, I tried with `$sentences = 'Entertainment media properties.A Fairy Tail and Tokyo Ghoul.';` See [this demo](http://ideone.com/x3P3xo). I guess the problem is with whitespace that may be missing. Try [this code](http://ideone.com/iAJEN2). — Wiktor Stribiżew, Jan 19 '16 at 16:29
@WiktorStribiżew Interesting. Can you please outline what changes you performed and how it may affect the text? The content in $sentences is pulled from external sites, so I can't control the text. Therefore, I need this to be as bulletproof as possible. — Henrik Petterson, Jan 19 '16 at 16:34
I just made the `\s+` optional with `\s*`. I see Henry is quick to read others' comments :) — Wiktor Stribiżew, Jan 19 '16 at 16:35
@WiktorStribiżew What's the overall change with the \s* approach and are there cases where this will break sentences incorrectly compared to \s+...? — Henrik Petterson, Jan 19 '16 at 16:40
That means you cannot just use the criteria you chose. You will have to add more blacklisted patterns, like "not before and after a digit". See [this regex](https://regex101.com/r/lG1rK5/2). Without a profound testing corpus, this task is very difficult. — Wiktor Stribiżew, Jan 19 '16 at 16:42
@WiktorStribiżew Got it. Thank you very much for the info. I will leave this question open and will bounty it with 50 points (when eligible) for a "bulletproof" solution, if such can be put to code. — Henrik Petterson, Jan 19 '16 at 16:45
You should take a look at this module: https://packagist.org/packages/nlp-tools/nlp-tools — Casimir et Hippolyte, Jan 19 '16 at 17:06
I think you will not get a precise bulletproof generic solution based just one regex. If a regex solution is posted it will have assumptions. I doubt you can account for all abbreviations and other special cases of using final punctuation. Too broad. — Wiktor Stribiżew, Jan 19 '16 at 17:09
@WiktorStribiżew I've opened a 200 point bounty given the effort it takes to come up for a bulletproof and complete solution. May that be in one regex or several. Feel free to give this a shot since you appear to be the regex guru here ;-) — Henrik Petterson, Jan 30 '16 at 13:39
As said, you cannot parse language with regular expressions. — , Feb 01 '16 at 20:36

ndnenkov · Accepted Answer · 2016-04-11T17:39:47.417

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.

The idea is to gradually go over the text.
At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a \A or \Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.

Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.

function sentence_split($text) {
    $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties.Â Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

This is an outstanding answer, thank you very much for posting it. It works in the scenario I asked in my question, but when adding it to my script, it still didn't work. I investigated it further and it appears that I didn't check the source code of the page properly. See this example: http://ideone.com/epdpxO It doesn't work with **"Entertainment media properties.Â Fairy Tail and Tokyo Ghoul."** -- do you have any skills with regex to adjust the function to detect this type of content? — Henrik Petterson, Jan 31 '16 at 14:01
Use [`html_entity_decode`](http://php.net/manual/fr/function.html-entity-decode.php) beforehand. — Lucas Trzesniewski, Jan 31 '16 at 16:15
@HenrikPetterson, while duplicating the regexes to work for the html encoded versions of the characters might be doable, it is way better to just decode the string as Lucas suggested. — ndnenkov, Jan 31 '16 at 18:37
`BEFORE_RE = /(?:#{RULES.map{|s,e,v| "(#{s})"}.join("|")})\Z/m` has an `/m` modifier that redefines the `.` matching behavior in Ruby. In PHP, it is equivalent to `/s`. However, I do not see any pattern here that uses a "special" dot inside, so the `m` modifier can be just removed from the regexps. Also, `\A` and `\Z` can be replaced with `^` and `$` for brevity (in PHP, `^` and `$` only match the start/end of line if the `/m` modifier is used - if you replace `\A` and `\Z` with `^` and `$`, you will *have to* remove `m` modifier). — Wiktor Stribiżew, Apr 11 '16 at 06:47
@WiktorStribiżew, so I should replace `/m`s with `/s`es then? — ndnenkov, Apr 11 '16 at 17:06
@WiktorStribiżew, I prefer to leave the directness of the translation apparent. There are a ton of things, it will only make it harder to trace possible mistakes. I would still want to fix it if the translation was incorrect though. — ndnenkov, Apr 11 '16 at 17:23
Then you need to replace `/m`with `/s`. Or let those who use it use correct modifier. Just `/\A(?:)/mu` does not make much sense, it can be written as `/^/u` — Wiktor Stribiżew, Apr 11 '16 at 17:29
@WiktorStribiżew, I know `\A(?:)` is pretty much equivalent to nothing. The reason I left it that way is because I grep-translated the whole thing. I prefer to leave it that way so it is more obvious what steps I took to do that. I will fix the regex modificators though. Thanks for that. — ndnenkov, Apr 11 '16 at 17:36
Excellent answer and extremely thorough, but also slow with large blocks of text. — Bangkokian, Dec 14 '19 at 12:04

score 6 · Answer 2 · answered Jan 19 '16 at 16:53

6

Â is what it looks like when you print a UTF-8 character U+00A0 Non-Breaking Space to a page/console being interpreted as Latin-1. So I think you have a non-breaking space between the sentences, not a normal space.

\s can match a non-breaking space too, but you will need to use the /u modifier to tell preg you are sending it a UTF-8-encoded string. Otherwise it, like your print command, will guess Latin-1 and see it as the two characters Â .

answered Jan 19 '16 at 16:53

bobince

528,062
107
651
834

1

Do you mind providing me with an example code of how the /u modifier would work as I can't seem to make it work as you suggest. Here is a http://ideone.com/ZQhPSV for reference. Also, please see the conversation I had with WiktorStribiżew above. – Henrik Petterson Jan 19 '16 at 21:27
Replace `/ix` with `/uix`. – bobince Jan 20 '16 at 20:30
I tried it but it didn't split the sentences. Please see: http://ideone.com/m164fp – Henrik Petterson Jan 21 '16 at 10:09
3

ideone's input is already UTF-8 encoded, so by putting `Â ` you have double-UTF-8-encoded your input string. Try it against the real input string. – bobince Jan 21 '16 at 19:47

score 3 · Answer 3 · answered Jan 31 '16 at 02:17

If spaces are unreliable, than you could use match on a . followed by any number of spaces, followed by a capital letter.

You can match any capital UTF-8 letter using the Unicode character property \p{Lu}.

You only need to exclude abbreviations which tend to follow own names (person names, company names, etc), since they start with a capital letter.

function splitSentences($text) {
    $re = '/                # Split sentences ending with a dot
        .+?                 # Match everything before, until we find
        (
          $ |               # the end of the string, or
          \.                # a dot
          (?<!              #  Begin negative lookbehind.
            Mr\.            #   Skip either "Mr."
          | Mrs\.           #   or "Mrs.",
                            #   or... (you get the idea).
          )                 #   End negative lookbehind.
          "?                #   Optionally match a quote
          \s*               #   Any number of whitespaces
          (?=               #  Begin positive lookahead
            \p{Lu} |        #   an upper case letter, or
            "               #   a quote
          )
        )
        /iux';

    if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) { 
        return [];
    }

    $sentences = array_map('trim', $matches[0]);

    return $sentences;
}

$text = "Mr. Entertainment media properties.Â Fairy Tail 3.5 and Tokyo Ghoul.";
$sentences = splitSentences($text);

print_r($sentences);

Note: This answer might not be accurate enough for your situation. I'm unable to judge that. It does address the problem as described above and is easily understandable.

score 3 · Answer 4 · answered Feb 03 '16 at 10:38

Henrik Petterson Please read it completely because i need to repeat few things which already said above.

As above many people have mentioned that if you add a \u modifier it will work on Unicode character is TRUE and it is Working Perfectly in the example mentioned below

http://ideone.com/750lMn

<?php


    function splitSentences($text) {
        $re = '/# Split sentences on whitespace between them.
            (?<=                # Begin positive lookbehind.
              [.!?]             # Either an end of sentence punct,
            | [.!?][\'"]        # or end of sentence punct and quote.
            )                   # End positive lookbehind.
            (?<!                # Begin negative lookbehind.
              Mr\.              # Skip either "Mr."
            | Mrs\.             # or "Mrs.",
            | Ms\.              # or "Ms.",
            | Jr\.              # or "Jr.",
            | Dr\.              # or "Dr.",
            | Prof\.            # or "Prof.",
            | Vol\.             # or "Vol.",
            | A\.D\.            # or "A.D.",
            | B\.C\.            # or "B.C.",
            | Sr\.              # or "Sr.",
            | T\.V\.A\.         # or "T.V.A.",
                                # or... (you get the idea).
            )                   # End negative lookbehind.
            \s+                 # Split on whitespace between sentences.
            /uix';

        $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
        return $sentences;
    }

$sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. &Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.';

$sentences = splitSentences($sentences);

print_r($sentences);

Your examples which you have given in comments were not working because they don't have any white space characters between two sentences. And your code specifying it particularly that there must be a white space between sentences.

\s+                 # Split on whitespace between sentences.

The below example which you have in above comments is not working just because there is no space before Â.

http://ideone.com/m164fp

score 2 · Answer 5 · edited Jun 20 '20 at 09:12

I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.

Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.

vanderlee's php-sentence (depends on reasonably gramatically correct punctuation)

The following is a rough list of the rules used to split sentences.

Each linebreak separates sentences.

The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.

Sentences must be at least two words long, unless a linebreak or end-of-text.

An empty line is not a sentence.

Each question- or exclamation mark or combination thereof, is considered the end of a sentence.

A single period is considered the end of a sentence, unless...

It is preceded by one word, or...

It is followed by one word.

A sequence of multiple periods is not considered the end of a sentence.

Usage example:

<?php
    require_once 'classes/autoloader.php'; // Include the autoloader.
    $text   = "Hello there, Mr. Smith. What're you doing today... Smith,"
            . " my friend?\n\nI hope it's good. This last sentence will"
            . " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
    $Sentence   = new Sentence;   // Create a new instance
    $sentences  = $Sentence->split($text); // Split into array of sentences
    $count      = $Sentence->count($text); // Count the number of sentences
?>

NlpTools is another library you might utilize for this task. Here is a sample code implementing a naive rule based sentence tokenizer:

Sample code:

<?php
include ('vendor/autoload.php');
 
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
 
class EndOfSentence implements ClassifierInterface
{
    public function classify(array $classes, DocumentInterface $d) {
        list($token,$before,$after) = $d->getDocumentData();
 
        $dotcnt = count(explode('.',$token))-1;
        $lastdot = substr($token,-1)=='.';
 
        if (!$lastdot) // assume that all sentences end in full stops
            return 'O';
 
        if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
            return 'O';
 
        return 'EOW';
    }
}
$tok = new ClassifierBasedTokenizer(
    new EndOfSentence(),
    new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
        Excellence, then, is not an act, but a habit.";
 
print_r($tok->tokenize($text));
 
// Array
// (
//    [0] => We are what we repeatedly do.
//    [1] => Excellence, then, is not an act, but a habit.
// )

You can get a PHP/JAVA bridge for using Java StanfordNLP (here is a Java example of splitting text into sentences).

IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:

$txt = preg_replace('~\p{P}+~', "$0 ", $txt);

Thank you for the rundown of relevant scripts. I have a question. The preg_replace() regex example at the end, does it add a space after *every* punctuation or what exactly does it do? There are various instances where a space shouldn't be added. For example "3.50" — Henrik Petterson, Feb 03 '16 at 13:03
It will add a space after each one or more punctuation, and it is good for counting sentences. If you want to get the sentences, some more complex post-process would be required. — Wiktor Stribiżew, Feb 03 '16 at 13:05
I'm choosing @ndn answer however I would like to thank you so much for taking the time to post this answer which will be very useful when we perform unit tests etc. — Henrik Petterson, Feb 06 '16 at 20:46

score 2 · Answer 6 · answered Jun 16 '22 at 20:15

I know this question is old and has been nicely answer by @ndnenkov but I figured i could clean up the PHP and make it more efficient since it was really slow on large bodies of text.

Here are my updates:

function sentence_split($text) {
    // put regex tests into an easier to read array
    $regexes = array(
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
            "after"=>'/\A(?:)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
            "after"=>'/\A(?:[\p{N}\p{Ll}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
            "after"=>'/\A(?:[^\p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
            "after"=>'/\A(?:[^\p{Lu}]|I)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b[Ee]tc\.\s))\Z/su',
            "after"=>'/\A(?:[^p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
            "after"=>'/\A(?:\p{Ll})/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b\p{L}\.))\Z/su',
            "after"=>'/\A(?:\p{L}\.)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b\p{L}\.\s))\Z/su',
            "after"=>'/\A(?:\p{L}\.\s)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
            "after"=>'/\A(?:\p{N})/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\"”\']\s*))\Z/su',
            "after"=>'/\A(?:\s*\p{Ll})/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
            "after"=>'/\A(?:)/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
            "after"=>'/\A(?:\p{Lu}[^\p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su',
            "after"=>'/\A(?:\p{Lu}\p{Ll})/su'
        ]
    );

    $sentences = array();
    $sentence = '';
    $before = '';
    $testLen = 10; // Used to set before/after chunk sizes. 10 seems to be the smallest that works the best.
    $after = substr($text, 0, $testLen); // start with the first set of chars.

    while($text != '') {
        // run regex tests
        foreach($regexes as $reg) {
            if(preg_match($reg["before"], $before) && preg_match($reg["after"], $after)) {
                // if this passes a sentence ending test then add to the array
                if($reg["is_sentence_boundary"]) {
                    $sentences[] = $sentence;
                    $sentence = '';
                }
                break;
            }
        }

        // add the char to the sentence
        $sentence .= $after[0];

        // eat at text until empty to end loop
        $text = substr($text, 1);

        // add a char behind the before var and then remove the first char
        $before = substr($before.$after[0], -$testLen);

        // create a new after with the first chars from the text
        $after = substr($text, 0, $testLen);

    }

    if($sentence != '') {
        $sentences[] = $sentence . $after;
    }
    return $sentences;
}
$text = "Mr. Entertainment media properties.Â Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

score 1 · Answer 7 · answered Feb 03 '16 at 07:57

There is quite complex Unicode Text Segmentation algorithm that deals with various text boundaries including sentence boundaries.

http://unicode.org/reports/tr29/

The best known implementation of this algorithms is by ICU.

I have found this class: http://php.net/manual/en/class.intlbreakiterator.php however it seems to be in git not in mainstream.

So if you want to solve this VERY complex problem in best why I'd suggest to:

Get this class from somewhere
Write a small PHP plugin that wraps ICU functionality you need - it is actually quite simple as long as you build specific functionality.

Split string into sentences using regex

Bounty info

7 Answers7

Linked

Related