10

How can I split a text into an array of sentences?

Example text:

Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End

Should output:

0 => Fry me a Beaver.
1 => Fry me a Beaver!
2 => Fry me a Beaver?
3 => Fry me Beaver no. 4?!
4 => Fry me many Beavers...
5 => End

I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.

/(?<=[!?.])./

/\.|\?|!/

/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/

/(?<=[.!?]|[.!?][\'"])\s+/    // <- closest one
thelolcat
  • 10,995
  • 21
  • 60
  • 102
  • 2
    The sentence #4 doesn't follow standard syntax. You need a class of `Terminators` - tokens that mark the end of a sentence. If you use one of the terminators as a regular symbol, then it's either not a terminator or you're misforming the sentences. You can't have your cake and eat it too, to put it simply. – Shark May 04 '13 at 18:14
  • I make cakes and eat them all the time :P Can a regex look ahead like 2 characters and if 2nd character is not uppercase A-Z it means that the punctuation before is not valid – thelolcat May 04 '13 at 18:16
  • Sounds like you already know what needs to be done. – Shark May 04 '13 at 18:20
  • But how do i get that into the regex? – thelolcat May 04 '13 at 18:21
  • @thelolcat you are better off with your own parser..a single regex won't do! You have to consider sentences which contains `Mr.thelolcat`, `no.1` – Anirudha May 04 '13 at 18:32
  • what computer in this world should know that this: `no. 4?!` is the end of a sentence? What if it's `no. 4 (the number after 3)?!` You currently entering spheres which are reserved for Chuck Norris – hek2mgl May 04 '13 at 18:40
  • @lolcat what your asking can be done with regexes, what you need is a zero width assertion, also the last regex you gave seems to work, what do you think is wrong with it – aaronman May 04 '13 at 18:41

2 Answers2

31

Since you want to "split" sentences why are you trying to match them ?

For this case let's use preg_split().

Code:

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);

Output:

Array
(
    [0] => Fry me a Beaver.
    [1] => Fry me a Beaver!
    [2] => Fry me a Beaver?
    [3] => Fry me Beaver no. 4?!
    [4] => Fry me many Beavers...
    [5] => End
)

Explanation:

Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:

  1. (?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.

  2. (?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4 problem.

HamZa
  • 14,671
  • 11
  • 54
  • 75
  • 1
    just a question: shouldn't `\s` be `\s+` ? I mean to ignore multiple spaces grouped toghether – thelolcat May 04 '13 at 19:09
  • @thelolcat Well you're right in case there is multiple spaces ! – HamZa May 04 '13 at 19:09
  • @HamZa : what would it translate to be, in java? I tried the same thing in java but it doesn't work. Can you guide me ? – voidMainReturn Jul 15 '13 at 12:47
  • @tejas I would guess you need to use double backslash instead of one `(?<=[.?!])\\s+(?=[a-z])` – HamZa Jul 15 '13 at 12:51
  • yes this is what I am using : str.split("(?<=[.?!])\\s+(?=[a-z])"); But of no use. – voidMainReturn Jul 15 '13 at 12:54
  • @tejas What do you mean "not correctly" ? Do you mind to join me in the [regex chatroom](http://chat.stackoverflow.com/rooms/25767) ? – HamZa Jul 15 '13 at 12:58
  • actually, the following one worked : str.split("(?<=[.?!])\\s+(?=[a-zA-Z])") – voidMainReturn Jul 15 '13 at 13:00
  • @tejas You see that little `i` after `/` ? That means match case insensitive. I think you could use my expression and add `(?i)` to the beginning of it :) – HamZa Jul 15 '13 at 13:02
  • yeah ok. I didn't know it's used as ?i in java. I tried using it as /i and it didn't work – voidMainReturn Jul 15 '13 at 13:06
  • @tejas not only in Java, it's possible in much more languages. – HamZa Jul 15 '13 at 13:08
  • 1
    Thank you! Add it to my helper library - https://github.com/Cosmologist/Gears/blob/master/src/Gears/StringType/Text.php – Cosmologist Oct 06 '16 at 13:50
  • I love this and would love it even more if I could disqualify `...` from counting as the end of a sentence and *include* `.)` as the end of a sentence. Ideas? Thanks. – Ryan Feb 11 '17 at 00:35
  • 1
    @Ryan quick [`(?<!\.\.\.)(?<=[.?!]|\.\))\s+(?=[a-z])`](https://regex101.com/r/6bf73H/1). See if it suits your needs. – HamZa Feb 11 '17 at 00:39
  • 3
    Based on what I learned from yours, I was able to edit it to handle even more corner cases that I'm running into: https://regex101.com/r/e4NYyd/4 Cool stuff. – Ryan Feb 11 '17 at 01:39
  • 1
    This doesn't work. Try adding "i.e. " to the sentence, this regex fails at this – Richard May 17 '18 at 19:42
  • 1
    Also it doesn't work for sentences with Mr. and !" at the end of sentence. – holden321 Dec 21 '18 at 16:20
  • 1
    "2020 is the year the system failed." Sentences may start with a digit... which makes avoiding "See (A. 1) for reference." more complex. – Markus AO Sep 16 '20 at 21:07
1

I recommend searching for your delimiting punctuation without a lookbehind, then releaseing those matched characters (with \K), then matching the space, then looking ahead for an uppercase letter representing the start of the next sentence.

Code: (Demo)

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';

var_export(
    preg_split('~[.?!]+\K\s+(?=[A-Z])~', $str, 0, PREG_SPLIT_NO_EMPTY)
);

Output:

array (
  0 => 'Fry me a Beaver.',
  1 => 'Fry me a Beaver!',
  2 => 'Fry me a Beaver?',
  3 => 'Fry me Beaver no. 4?!',
  4 => 'Fry me many Beavers...',
  5 => 'End',
)

Though not necessary for the sample string, PREG_SPLIT_NO_EMPTY will prevent creating an empty element at the end of the array if the string ends with a punctuation.

Using \K in my answer requires less backtracking. This allows the regex engine to "step" through the string with greater efficiency. In Hamza's answer, the regex engine starts matching every time there is a space, then after the space is matched, it needs to read backward to check for the punctuation, then if that qualifies, it then needs to look ahead for a letter.

In my approach, the regex engine only begins considering matches when it encounters one of the listed punctuation symbols, and it never looks back. There are many spaces to match, but much fewer qualifying symbols. For these reasons, on the sample input string, my pattern splits the string in 40 steps and Hamza's pattern splits the string in 74 steps.

This efficiency is not really worth bragging about for relatively small strings, but if you are parsing large texts, then efficiency and minimizing backtracking becomes more important.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136