0

I need a function which will properly convert a non-ASCII symbols to \uXXXX representation. I know json_encode does that, but it adds double quotes to the string and I assume there might be a more refined solution, consuming less CPU than in case of using json_encode per each symbol.

Here's the current solution:

    $input=preg_replace_callback('#([^\r\n\t\x20-\x7f])#u', function($m) {
        return trim(json_encode($m[1]),'"');
    }, $input);

Does anyone have an idea of a simplier and faster solution?

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
NikitOn
  • 448
  • 4
  • 10

1 Answers1

2

Since your current solution uses the u regex modifier, I'm assuming your input is encoded as UTF-8.

The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.

$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
    $utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
    if (strlen($utf16) <= 2) {
        $esc = '\u' . bin2hex($utf16);
    }
    else {
        $esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
               '\u' . bin2hex(substr($utf16, 2, 2));
    }
    return $esc;
}, $input);

One fundamental problem is that PHP doesn't have an ord function that works with UTF-8. You either have to use mb_convert_encoding, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).

If you're aiming for simplicity and readability, you probably can't beat the json_encode approach.

Community
  • 1
  • 1
nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • Thanks for the explanation. Let's keep your answer as a good option. Maybe someone will test both options for the speed :). – NikitOn Oct 26 '16 at 22:21