I have this function, which makes use of preg_replace_callback to split a sentence into "chains" of blocks belonging to different categories (alphabetic, han characters, everything else).
The function is trying to also include the characters ' , { , and } as "alphabetic"
function String_SplitSentence($string)
{
$res = array();
preg_replace_callback("~\b(?<han>\p{Han}+)\b|\b(?<alpha>[a-zA-Z0-9{}']+)\b|(?<other>[^\p{Han}A-Za-z0-9\s]+)~su",
function($m) use (&$res)
{
if (!empty($m["han"]))
{
$t = array("type" => "han", "text" => $m["han"]);
array_push($res,$t);
}
else if (!empty($m["alpha"]))
{
$t = array("type" => "alpha", "text" => $m["alpha"]);
array_push($res, $t);
}
else if (!empty($m["other"]))
{
$t = array("type" => "other", "text" => $m["other"]);
array_push($res, $t);
}
},
$string);
return $res;
}
However, something seems to be wrong with the curly braces.
print_r(String_SplitSentence("Many cats{1}, several rats{2}"));
As can be seen in the output, the function treats { as an alphabetic character, as indicated, but stops at } and treats it as "other" instead.
Array
(
[0] => Array
(
[type] => alpha
[text] => Many
)
[1] => Array
(
[type] => alpha
[text] => cats{1
)
[2] => Array
(
[type] => other
[text] => },
)
[3] => Array
(
[type] => alpha
[text] => several
)
[4] => Array
(
[type] => alpha
[text] => rats{2
)
[5] => Array
(
[type] => other
[text] => }
)
What am I doing wrong?