0

I am working on my own project that require the conversion from Chinese char to Unicode.

Currently, i am using the code below with no problem

base_convert(bin2hex(iconv("utf-8", "ucs-4", '人')), 16, 16) // Return 4eba

However, as I trying to add a form to convert the char that user input, the result were way different

base_convert(bin2hex(iconv("utf-8", "ucs-4", $_POST["char"])), 16, 16) // Return 2600000023000000000000000000000000000000000000000000000000

Thanks in advance!

  • Beware that `base_convert()` is possibly alright for individual characters but it's totally unsuitable for general strings because it works with actual numbers and you'll get in trouble as soon as you get an integer larger than `PHP_INT_MAX`. – Álvaro González Dec 30 '14 at 15:32

1 Answers1

0

If you want to get UTF-8 in the $_POST array you need to tell the browser that the form is to be submitted in UTF-8.

Generally the way to achieve this is to serve the page containing the form with an indicator that the page is encoded as UTF-8. Otherwise, the browser will arbitrarily guess which encoding is in use, and that guess probably won't be UTF-8. To indicate UTF-8 set the Content-Type header or include in the <head>:

<meta charset="utf-8"/>

If you include the character in a form field and the browser thinks the encoding is one (like cp1252 Western European) that does not include the character , it will panic and send instead an HTML-character-reference-encoded version, &#20154;. This is a non-useful data mangling as you can't tell whether the original input was or &#20154;, but it's an historical browser quirk we will now never get rid of.

This is why you get 2600000023000000: characters U+0026,U+0023 are the leading &# part of that mangled version. The rest of that string is 00 and not the subsequent characters because base_convert deals with floating-point numbers and 0x2600000023000000000000000000000000000000000000000000000000 is far too ludicrously large a number to retain precision.

If you are trying to convert UTF-8-encoded characters into numeric code points, try uniord/unichr.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834