Converting these types of unicode to UTF8 in PHP

Question

I am trying to convert this in to readable UTF8 text in PHP

Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv

Any ideas on how to do so?

Tried several methods online, but couldn't find one.

In this case I have unicode in Hebrew and Arabic

Duplicate: http://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-cha — Samuel Katz, Oct 26 '11 at 00:07

score 8 · Answer 1 · answered Sep 25 '11 at 14:48

8

None of the other answers work perfectly as is. I've combined them together and my addition results in this one:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

This one definitely does work :)

answered Sep 25 '11 at 14:48

dzeikei

2,256
1
21
27

I must mention that using mb_convert_encoding() method will convert any " in the original string into " because it involves parsing HTML!!! beware – dzeikei Oct 02 '11 at 10:34

score 3 · Answer 2 · answered Dec 12 '11 at 10:49

I encountered the same problem recently, so was glad to see this question. Doing some tests, I found the following code works:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $original_string);
//$unicodeString    = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

The only thing I changed is that I commented out the 2nd line of code. Webpage, however, must be set to display UTF-8.

Enjoy!

score 2 · Answer 3 · answered Dec 04 '10 at 20:12

it doesn't always work, because /uXXXX code sometimes can contain digits AND letters. try replacing \d (just digits) with \w (\w matches both words and digits).

function unicode_conv($originalString) {
  // The four \\\\ in the pattern here are necessary to match \u in the original string
  $replacedString = preg_replace("/\\\\u(\w{4})/", "&#$1;", $originalString);
  $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
  return $unicodeString;
}

Amber · Answer 4 · 2010-01-12T02:59:43.297

1

See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX pattern with the equivalent character.

Alternatively, you could replace each \uXXXX pattern with its matching &#XXXX; html entity form, and then use the following:

mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES');

More complete example:

// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

edited Jan 12 '10 at 02:59

answered Jan 11 '10 at 21:33

Amber

507,862
82
626
550

Could you give me an example? I didn't understand the example in the link. Say I have this string "\u05ea" somewhere in the text - how would I change it to its html entity form as its not "ea;" or the first option you mentioned. Thanks for the help. – Simon Jan 11 '10 at 21:47
Sure, I added a more complete example to my answer. – Amber Jan 12 '10 at 03:00
@Dav: Why `\\\\u`? Isn't `\\u` enough? I also think that `\d{2,4}` would make it more complete. – Alix Axel Jan 12 '10 at 03:05
1

Alix: `\u` would be interpreted by the regex engine as an escape-code u, sort of like how `\d` is the set of digits, and `\w` is the set of "word" characters. Thus you need to actually escape the slash in the *regex*, which means your regex needs to be `\\u`, and then you have to escape those slashes since they're within the string, thus you have \\\\ as the escaped form of \\. – Amber Jan 16 '10 at 07:10

score 1 · Answer 5 · answered Dec 02 '10 at 11:30

You should add 'x' after '#' in replacement string to indicate that hexadecimal numbers are used.

$replacedString = preg_replace("/\\\\u(\d{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

score 0 · Answer 6 · answered Jun 30 '23 at 12:53

There is a very simple and beautiful solution.

If we want to decode Unicode escape sequences like "\u05bc\u05dc" to "ל" we may use simple function json_decode:

$a="Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv";

echo json_decode("\"$a\"");

output:

Tel Aviv-Yafo (Hebrew: תֵּל־אָבִיב-יָפוֹ; Arabic: تل أبيب‎, Tall ʼAbīb), usually called Tel Aviv

It works because json_encode encodes all non utf-8 symbols to \u**** sequence:

echo json_encode("תֵּל");
# output: "\u05ea\u05b5\u05bc\u05dc"

Converting these types of unicode to UTF8 in PHP

6 Answers6

Linked