4

I am trying to convert this in to readable UTF8 text in PHP

Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv

Any ideas on how to do so?

Tried several methods online, but couldn't find one.

In this case I have unicode in Hebrew and Arabic

Simon
  • 49
  • 1
  • 1
  • 3
  • Duplicate: http://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-cha – Samuel Katz Oct 26 '11 at 00:07

6 Answers6

8

None of the other answers work perfectly as is. I've combined them together and my addition results in this one:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');

This one definitely does work :)

dzeikei
  • 2,256
  • 1
  • 21
  • 27
  • I must mention that using mb_convert_encoding() method will convert any " in the original string into " because it involves parsing HTML!!! beware – dzeikei Oct 02 '11 at 10:34
3

I encountered the same problem recently, so was glad to see this question. Doing some tests, I found the following code works:

$replacedString = preg_replace("/\\\\u([0-9abcdef]{4})/", "&#x$1;", $original_string);
//$unicodeString    = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES'); 

The only thing I changed is that I commented out the 2nd line of code. Webpage, however, must be set to display UTF-8.

Enjoy!

Yaron Cohen
  • 133
  • 1
  • 5
2

it doesn't always work, because /uXXXX code sometimes can contain digits AND letters. try replacing \d (just digits) with \w (\w matches both words and digits).

function unicode_conv($originalString) {
  // The four \\\\ in the pattern here are necessary to match \u in the original string
  $replacedString = preg_replace("/\\\\u(\w{4})/", "&#$1;", $originalString);
  $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
  return $unicodeString;
}
mykhi
  • 29
  • 2
1

See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX pattern with the equivalent character.

Alternatively, you could replace each \uXXXX pattern with its matching &#XXXX; html entity form, and then use the following:

mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES');

More complete example:

// The four \\\\ in the pattern here are necessary to match \u in the original string
$replacedString = preg_replace("/\\\\u(\d{4})/", "&#$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
Amber
  • 507,862
  • 82
  • 626
  • 550
  • Could you give me an example? I didn't understand the example in the link. Say I have this string "\u05ea" somewhere in the text - how would I change it to its html entity form as its not "ea;" or the first option you mentioned. Thanks for the help. – Simon Jan 11 '10 at 21:47
  • Sure, I added a more complete example to my answer. – Amber Jan 12 '10 at 03:00
  • @Dav: Why `\\\\u`? Isn't `\\u` enough? I also think that `\d{2,4}` would make it more complete. – Alix Axel Jan 12 '10 at 03:05
  • 1
    Alix: `\u` would be interpreted by the regex engine as an escape-code u, sort of like how `\d` is the set of digits, and `\w` is the set of "word" characters. Thus you need to actually escape the slash in the *regex*, which means your regex needs to be `\\u`, and then you have to escape those slashes since they're within the string, thus you have \\\\ as the escaped form of \\. – Amber Jan 16 '10 at 07:10
1

You should add 'x' after '#' in replacement string to indicate that hexadecimal numbers are used.

$replacedString = preg_replace("/\\\\u(\d{4})/", "&#x$1;", $originalString);
$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES');
petr
  • 511
  • 4
  • 7
0

There is a very simple and beautiful solution.

If we want to decode Unicode escape sequences like "\u05bc\u05dc" to "ל" we may use simple function json_decode:

$a="Tel Aviv-Yafo (Hebrew: \u05ea\u05b5\u05bc\u05dc\u05be\u05d0\u05b8\u05d1\u05b4\u05d9\u05d1-\u05d9\u05b8\u05e4\u05d5\u05b9; Arabic: \u062a\u0644 \u0623\u0628\u064a\u0628\u200e, Tall \u02bcAb\u012bb), usually called Tel Aviv";

echo json_decode("\"$a\"");

output:

Tel Aviv-Yafo (Hebrew: תֵּל־אָבִיב-יָפוֹ; Arabic: تل أبيب‎, Tall ʼAbīb), usually called Tel Aviv


It works because json_encode encodes all non utf-8 symbols to \u**** sequence:

echo json_encode("תֵּל");
# output: "\u05ea\u05b5\u05bc\u05dc"
Sergey Yurich
  • 51
  • 1
  • 4