5

I've got such strings

\u041d\u0418\u041a\u041e\u041b\u0410\u0415\u0412

How can I convert this to utf-8 encoding? And what is the encoding of given string? Thank you for participating!

hakre
  • 193,403
  • 52
  • 435
  • 836
Denis Óbukhov
  • 4,129
  • 4
  • 20
  • 27

3 Answers3

11

The simple approach would be to wrap your string into double quotes and let json_decode convert the \u0000 escapes. (Which happen to be Javascript string syntax.)

 $str = json_decode("\"$str\"");

Seems to be russian letters: НИКОЛАЕВ (It's already UTF-8 when json_decode returns it.)

mario
  • 144,265
  • 20
  • 237
  • 291
1

To parse that string in PHP you can use json_decode because JSON supports that unicode literal format.

Alex Turpin
  • 46,743
  • 23
  • 113
  • 145
  • 1
    `json_decode` returns an UTF-8 string, using `utf8_encode` again would not be helpful. – hakre Oct 25 '11 at 18:55
0

To preface, you generally should not be encountering \uXXXX unicode escape sequences outside of JSON documents, in which case you should be decoding those documents using json_decode() rather than trying to cherry-pick strings out of the middle by hand.

If you want to generate JSON documents without unicode escape sequences, then you should use the JSON_UNESCAPED_UNICODE flag in json_encode(). However, the escapes are default as they are most likely to be safely transmitted through various intermediate systems. I would strongly recommend leaving escapes enabled unless you have a solid reason not to.

Lastly, if you're just looking for something to make unicode text "safe" in some fashion, please instead read over the following SO masterpost: UTF-8 all the way through

If, after three paragraphs of "don't do this", you still want to do this, then here are a couple functions for applying/removing \uXXXX escapes in arbitrary text:

<?php

function utf8_escape($input) {
    $output = '';
    for( $i=0,$l=mb_strlen($input); $i<$l; ++$i ) {
        $cur = mb_substr($input, $i, 1);
        if( strlen($cur) === 1 ) {
            $output .= $cur;
        } else {
            $output .= sprintf('\\u%04x', mb_ord($cur));
        }
    }
    return $output;
}

function utf8_unescape($input) {
    return preg_replace_callback(
        '/\\\\u([0-9a-fA-F]{4})/',
        function($a) {
            return mb_chr(hexdec($a[1]));
        },
        $input
    );
}

$u_input = 'hello world, 私のホバークラフトはうなぎで満たされています';
$e_input = 'hello world, \u79c1\u306e\u30db\u30d0\u30fc\u30af\u30e9\u30d5\u30c8\u306f\u3046\u306a\u304e\u3067\u6e80\u305f\u3055\u308c\u3066\u3044\u307e\u3059';

var_dump(
    utf8_escape($u_input),
    utf8_unescape($e_input)
);

Output:

string(145) "hello world, \u79c1\u306e\u30db\u30d0\u30fc\u30af\u30e9\u30d5\u30c8\u306f\u3046\u306a\u304e\u3067\u6e80\u305f\u3055\u308c\u3066\u3044\u307e\u3059"
string(79) "hello world, 私のホバークラフトはうなぎで満たされています"
Sammitch
  • 30,782
  • 7
  • 50
  • 77