I recommend searching for your delimiting punctuation without a lookbehind, then releaseing those matched characters (with \K
), then matching the space, then looking ahead for an uppercase letter representing the start of the next sentence.
Code: (Demo)
$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
var_export(
preg_split('~[.?!]+\K\s+(?=[A-Z])~', $str, 0, PREG_SPLIT_NO_EMPTY)
);
Output:
array (
0 => 'Fry me a Beaver.',
1 => 'Fry me a Beaver!',
2 => 'Fry me a Beaver?',
3 => 'Fry me Beaver no. 4?!',
4 => 'Fry me many Beavers...',
5 => 'End',
)
Though not necessary for the sample string, PREG_SPLIT_NO_EMPTY
will prevent creating an empty element at the end of the array if the string ends with a punctuation.
Using \K
in my answer requires less backtracking. This allows the regex engine to "step" through the string with greater efficiency. In Hamza's answer, the regex engine starts matching every time there is a space, then after the space is matched, it needs to read backward to check for the punctuation, then if that qualifies, it then needs to look ahead for a letter.
In my approach, the regex engine only begins considering matches when it encounters one of the listed punctuation symbols, and it never looks back. There are many spaces to match, but much fewer qualifying symbols. For these reasons, on the sample input string, my pattern splits the string in 40 steps and Hamza's pattern splits the string in 74 steps.
This efficiency is not really worth bragging about for relatively small strings, but if you are parsing large texts, then efficiency and minimizing backtracking becomes more important.