Since your current solution uses the u
regex modifier, I'm assuming your input is encoded as UTF-8.
The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.
$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
$utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
if (strlen($utf16) <= 2) {
$esc = '\u' . bin2hex($utf16);
}
else {
$esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
'\u' . bin2hex(substr($utf16, 2, 2));
}
return $esc;
}, $input);
One fundamental problem is that PHP doesn't have an ord
function that works with UTF-8. You either have to use mb_convert_encoding
, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).
If you're aiming for simplicity and readability, you probably can't beat the json_encode
approach.