16

I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Noam
  • 3,341
  • 4
  • 35
  • 64
  • 1
    What regex are you using? NLP in PHP sounds like it's going to cause you a heap of pain. – fredley Feb 17 '11 at 17:18
  • "pain" because it's slower than say C? This is the regex I'm using: `preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/",$text,-1, PREG_SPLIT_DELIM_CAPTURE);` What approach would you recommend? – Noam Feb 19 '11 at 07:45
  • Will https://github.com/bigwhoop/sentence-breaker library of any use to you? – SenG Jun 16 '15 at 04:54

6 Answers6

24

An enhanced regex solution

Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://stackoverflow.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:

This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!

Here is the output from the script:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]

The essential regex solution

The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.

Edit: 20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)

Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • This still is a very direct approach. I'm looking for something generic, which was built through a learning process. Your solution overlooks many options. – Noam May 04 '11 at 12:50
  • @giorgio79: Yes, if the "elipsis" consists of three dots in a row. If you are talking about a single Unicode char representing an elipsis, then this Unicode char would need to be added to the character class for this regex to work. – ridgerunner Aug 08 '11 at 14:23
  • @Noam - If you specifically want a solution that is based on machine learning, please update your question. – David Meister Dec 05 '12 at 11:28
  • With this enhanced regex solution, how can I detect "T.V.A" word ? I make this `| [t|T]\.[v|V]\.[a|A]\. # or "T.V.A",` but it doesn't work – LeMoussel Aug 19 '13 at 07:33
  • @PapyRef - Yes, easily. Take a look at the regex. See the list of exceptions? i.e. `Mr\.|Mrs\.|Ms\.|etc...`? Just add your `T\.V\.A\.` term to this list separating it from the others with the or `|` operator. (Don't forget you need to escape the dots.) – ridgerunner Aug 19 '13 at 15:04
  • @ridgerunner I add 'T\.V\.A\.' term to this list separating it from the others with '|' operator but it doesn't exclude this word. – LeMoussel Aug 20 '13 at 16:29
  • @PapyRef - In the solution above, I've added `T.V.A.` to the list of abbreviations to be ignored. Hope this helps. – ridgerunner Aug 20 '13 at 18:04
2

Slight improvement on someone else's work:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);
aksu
  • 5,221
  • 5
  • 24
  • 39
clutterjoe
  • 61
  • 3
  • 7
0

@ridgerunner I wrote your PHP code in C #

I get like 2 sentences as result :

  • Mr. J. Dujardin régle sa T.V.
  • A. en esp. uniquement

The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement

and with our test paragraph

string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";

The result is

index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!

C# code :

                string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
                Regex rx = new Regex(@"(\S.+?
                                       [.!?]               # Either an end of sentence punct,
                                       | [.!?]['""]         # or end of sentence punct and quote.
                                       )
                                       (?<!                 # Begin negative lookbehind.
                                          Mr.                   # Skip either Mr.
                                        | Mrs.                  # or Mrs.,
                                        | Ms.                   # or Ms.,
                                        | Jr.                   # or Jr.,
                                        | Dr.                   # or Dr.,
                                        | Prof.                 # or Prof.,
                                        | Sr.                   # or Sr.,
                                        | \s[A-Z].              # or initials ex: George W. Bush,
                                        | T\.V\.A\.             # or "T.V.A."
                                       )                    # End negative lookbehind.
                                       (?=|\s+|$)", 
                                       RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
                foreach (Match match in rx.Matches(sText))
                {
                    Console.WriteLine("index: {0}  sentence: {1}", match.Index, match.Value);
                }
LeMoussel
  • 5,290
  • 12
  • 69
  • 122
0

As a low-tech approach, you might want to consider using a series of explode calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.

Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.

I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?

Trav
  • 155
  • 8
  • This doesn't answer my question because I'm looking for a library that does it for me. But, can you explain the difference between using explode and preg_split? – Noam Apr 29 '11 at 08:31
  • @Noam: `explode()` splits on a simple string match, without doing any regex. The implication of the answer being that for your use case it should be simple enough to do it without regex; ie just explode on each common punctuation mark. However I agree, it doesn't really answer your question, or even address what you're trying to ask. You're aiming for accuracy, which isn't what he's focusing on at all. (but if you were to take this approach, I'd consider `strtok()` to be a better solution than `explode()` due to the multiple punctuation characters involved) – Spudley Apr 30 '11 at 20:12
0

I was using this regex:

preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);

Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses

explode('.',$text);

because I decided speed was more important than accuracy.

jisaacstone
  • 4,234
  • 2
  • 25
  • 39
0

Build a list of abbreviations like this

$skip_array = array ( 

'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.

Compile them into a an expression

$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}

Last run this preg_split to break into sentences.

$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
                     $txt, -1, PREG_SPLIT_NO_EMPTY);

And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.<p></p> If you have situations.Like this where.They stick together it becomes immensely more difficult to parse.

user723220
  • 817
  • 3
  • 12
  • 20
  • Explode just blows a string into pieces based on a `delimiter`. If you say `explode(" ", "Where are my suspenders?") The delimiter is `" "` empty space. PHP will `explode` your string into pieces when it encounters the blank space. In this case, resulting in four words which are stored in an `array` as `keys` [0-3]. The `delimiter` can be anything, `&, #, -, : `etc. `preg_split` is a more complicated exploder, which incorporates a number of `metacharacters, switches, functions and expressions`, as in the example above. – user723220 Apr 30 '11 at 20:00