0

Let's say I want to send Unicode glyph U+CABC 쪼 via a web service to be saved in a database.

For example, wget is being used to connect to a web service:

shell_exec("wget 'http://doit.com/testing.php?glyph=.f(0xCABC)."'")

Where f is the PHP function (or functions) to convert/encode/escape the glyph U+CABC.

In testing.php, the glyph is accessed via $_REQUEST:

$glyph = $_REQUEST['glyph'];

I'd like to put it in the DB, so let's set up a query string like this:

$query = 'INSERT INTO UTF8_TABLE (UTF8_FIELD) VALUES ('.g($glyph).')';.

Where g is the PHP function (or functions) to convert the glyph into a MySQL compatible representation.

I can't seem to find what I need for the functions f and g.

For f, tried escaping and encoding via numerous functions, e.g. as HTML encoded UTF-8: %EC%AA%BC. For g, tried various unescaping and decoding functions, e.g. html_entity_decode, utf_decode, etc.

But no matter how I encode it, it always gets interpreted as a string of three characters 쪼, which are then saved in the DB as 쪼 (i.e. six bytes), and not as 쪼 (i.e. three bytes).

I haven't even begun to figure out how to return the glyph via SQL SELECT and encoding JSON, but for now, would just like a straightforward way to handle UTF-8 from origin to destination.

1 Answers1

2
$glyph = "쪼"; //or
$glyph = "\xEC\xAA\xBC";

This is your glyph encoded in UTF-8. The former works if you save your source code in UTF-8, the latter works in any case. To transport this in a URL, URL-encode it:

$url = 'http://...?glyph=' . rawurlencode($glyph);

On the server, PHP will automatically decode it again, so:

$glyph = $_GET['glyph'];

From there, insert it into the database the same way you would any other UTF-8 encoded text, mostly making sure the database connection encoding is set correctly. See UTF-8 all the way through.

Community
  • 1
  • 1
deceze
  • 510,633
  • 85
  • 743
  • 889
  • Thank you, set_charset() was the missing key. Had changed databases and didn't realize they were using latin1. Strange that the entire DB, all text fields and collations are set to UTF-8, but this extra step is needed! –  Dec 24 '13 at 16:32