You may extend @ndn's solution to achieve what you need. Note that $before_regexes
contain a list of known abbreviations, add those that are present in your corpora. I added qtd
there.
Then, note that $before_regexes
and $after_regexes
are paired. I added '/(?:[”’"\'»])\s*\Z/u'
/ '/\A(?:\(\p{L})/u'
pair and marked it as a non-sentence boundary (with the first false
in the $is_sentence_boundary
array. The regex pair means: find a quotation mark (”’"'»
), 0+ whitespaces, and then followed with (
(with \(
) and any Unicode letter (\p{L}
), then there should be no split.
function sentence_split($text) {
$before_regexes = array('/(?:[”’"\'»])\s*\Z/u',
'/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su',
'/(?:(?:\b[Ee]tc\.\s))\Z/su',
'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
'/(?:(?:\b\p{L}\.))\Z/su',
'/(?:(?:\b\p{L}\.\s))\Z/su',
'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
'/(?:(?:[\"”\']\s*))\Z/su',
'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
$after_regexes = array('/\A(?:\(\p{L})/u',
'/\A(?:)/su',
'/\A(?:[\p{N}\p{Ll}])/su',
'/\A(?:[^\p{Lu}])/su',
'/\A(?:[^\p{Lu}]|I)/su',
'/\A(?:[^p{Lu}])/su',
'/\A(?:\p{Ll})/su',
'/\A(?:\p{L}\.)/su',
'/\A(?:\p{L}\.\s)/su',
'/\A(?:\p{N})/su',
'/\A(?:\s*\p{Ll})/su',
'/\A(?:)/su',
'/\A(?:\p{Lu}[^\p{Lu}])/su',
'/\A(?:\p{Lu}\p{Ll})/su');
$is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true);
$count = 13;
$sentences = array();
$sentence = '';
$before = '';
$after = substr($text, 0, 10);
$text = substr($text, 10);
while($text != '') {
for($i = 0; $i < $count; $i++) {
if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
if($is_sentence_boundary[$i]) {
array_push($sentences, $sentence);
$sentence = '';
}
break;
}
}
$first_from_text = $text[0];
$text = substr($text, 1);
$first_from_after = $after[0];
$after = substr($after, 1);
$before .= $first_from_after;
$sentence .= $first_from_after;
$after .= $first_from_text;
}
if($sentence != '' && $after != '') {
array_push($sentences, $sentence.$after);
}
return $sentences;
}
See the PHP demo.