I've got such strings
\u041d\u0418\u041a\u041e\u041b\u0410\u0415\u0412
How can I convert this to utf-8 encoding? And what is the encoding of given string? Thank you for participating!
I've got such strings
\u041d\u0418\u041a\u041e\u041b\u0410\u0415\u0412
How can I convert this to utf-8 encoding? And what is the encoding of given string? Thank you for participating!
The simple approach would be to wrap your string into double quotes and let json_decode
convert the \u0000
escapes. (Which happen to be Javascript string syntax.)
$str = json_decode("\"$str\"");
Seems to be russian letters: НИКОЛАЕВ
(It's already UTF-8 when json_decode
returns it.)
To parse that string in PHP you can use json_decode
because JSON supports that unicode literal format.
To preface, you generally should not be encountering \uXXXX
unicode escape sequences outside of JSON documents, in which case you should be decoding those documents using json_decode()
rather than trying to cherry-pick strings out of the middle by hand.
If you want to generate JSON documents without unicode escape sequences, then you should use the JSON_UNESCAPED_UNICODE
flag in json_encode()
. However, the escapes are default as they are most likely to be safely transmitted through various intermediate systems. I would strongly recommend leaving escapes enabled unless you have a solid reason not to.
Lastly, if you're just looking for something to make unicode text "safe" in some fashion, please instead read over the following SO masterpost: UTF-8 all the way through
If, after three paragraphs of "don't do this", you still want to do this, then here are a couple functions for applying/removing \uXXXX
escapes in arbitrary text:
<?php
function utf8_escape($input) {
$output = '';
for( $i=0,$l=mb_strlen($input); $i<$l; ++$i ) {
$cur = mb_substr($input, $i, 1);
if( strlen($cur) === 1 ) {
$output .= $cur;
} else {
$output .= sprintf('\\u%04x', mb_ord($cur));
}
}
return $output;
}
function utf8_unescape($input) {
return preg_replace_callback(
'/\\\\u([0-9a-fA-F]{4})/',
function($a) {
return mb_chr(hexdec($a[1]));
},
$input
);
}
$u_input = 'hello world, 私のホバークラフトはうなぎで満たされています';
$e_input = 'hello world, \u79c1\u306e\u30db\u30d0\u30fc\u30af\u30e9\u30d5\u30c8\u306f\u3046\u306a\u304e\u3067\u6e80\u305f\u3055\u308c\u3066\u3044\u307e\u3059';
var_dump(
utf8_escape($u_input),
utf8_unescape($e_input)
);
Output:
string(145) "hello world, \u79c1\u306e\u30db\u30d0\u30fc\u30af\u30e9\u30d5\u30c8\u306f\u3046\u306a\u304e\u3067\u6e80\u305f\u3055\u308c\u3066\u3044\u307e\u3059"
string(79) "hello world, 私のホバークラフトはうなぎで満たされています"