13

I have a Unicode text-block, like this:

ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ

Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

Not like this:

0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110

Is there any way to do it, by PHP?


I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.


I am sorry, I don't know much about Unicode.

Community
  • 1
  • 1
  • 1
    You have to know (or try to guess, but that only works some of the time) what encoding your input is in. If it's already in UTF-8 then it's probably already in the format you want -- assuming that by `0xe1` you don't mean the 4 bytes representing `0`, `x`, `e`, `1` but rather one byte representing the number 225. – Jon Jul 19 '15 at 13:48
  • 2
    The [second answer on the question you link to](http://stackoverflow.com/a/7153133/266143) _does_ convert a Unicode code point to UTF-8 bytes. – CodeCaster Jul 19 '15 at 13:49
  • Can you show what you have tried? So that we could know exactly what you are trying to do. Currently, there are many ways to interpret your question, as we are trying to guess your purpose in doing such conversion. – nhahtdh Jul 20 '15 at 05:04

3 Answers3

13

I think you're looking for the bin2hex() function:

Convert binary data into hexadecimal representation

And format by prepending \x to each byte (00-FF)

function str_hex_format ($bin) {
  return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}

For your sample:

// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];

foreach($arr AS $v)
  echo $v . " => " . str_hex_format($v) . "\n";

See test at eval.in (link expires)

ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90

Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

echo hex2bin(str_replace('\x', "", $str));

ụưứỲỶỴĐ


For more info about escape sequence \x in double quoted strings see php manual.

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • +1. That's exactly how I do it for codepoints.net: https://github.com/Codepoints/Codepoints.net/blob/19184d5cf40f9d335487db9ad58318af2ba0149c/codepoints.net/lib/codepoint.class.php#L99-L104 – Boldewyn Jul 22 '15 at 07:40
3

PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:

$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
  echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);

Output:

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:

$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
  foreach(str_split($UTF8char) as $char)
    echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
  echo "\n"; // delimiter
}

Output:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.

Edit: A more "correct" way to do this is this:

echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');
Leo Jiang
  • 24,497
  • 49
  • 154
  • 284
1

The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.

This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.

<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";

// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";

for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
    $c = mb_substr($utf8str, $i, 1, 'UTF-8');
    $hex = bin2hex($c);
    echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}

?>

Produces

length=7
ụưứỲỶỴĐ
ụ   e1bba5  \xe1\xbb\xa5
ư   c6b0    \xc6\xb0
ứ   e1bba9  \xe1\xbb\xa9
Ỳ   e1bbb2  \xe1\xbb\xb2
Ỷ   e1bbb6  \xe1\xbb\xb6
Ỵ   e1bbb4  \xe1\xbb\xb4
Đ   c490    \xc4\x90
schtever
  • 3,210
  • 16
  • 25