0

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.


I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.

I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):

<?php
$val = array("Millán");
print json_encode($val)."\n";

According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.

Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):

$ grep ill test.php | od -An -t x1c
  24  76  61  6c  20  3d  20  61  72  72  61  79  28  22  4d  69
   $   v   a   l       =       a   r   r   a   y   (   "   M   i
  6c  6c  c3  a1  6e  22  29  3b  0a
   l   l 303 241   n   "   )   ;  \n

And here is the output from PHP:

$ php -f test.php | od -An -t x1c
  5b  22  4d  69  6c  6c  5c  75  30  30  65  31  6e  22  5d  0a
   [   "   M   i   l   l   \   u   0   0   e   1   n   "   ]  \n

The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.

How can I keep PHP/json_encode from switching the encoding of this variable?

EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.

Dave Gray
  • 715
  • 5
  • 11
  • A `\uXXXX` escape sequence is a perfectly valid way to encode arbitrary characters in the JSON data format. When decoding it from JSON you'll get your character back. Am I missing anything beyond that? – deceze Mar 19 '14 at 17:55
  • Well, \u00e1 is not the same as \u00c3\u00a1 - one of those is a valid utf8 character, and the other is not. PHP is able to translate \u00e1 back into utf8 somehow. The issue I'm running into is that when the encoded string with \u00e1 leaves PHP world, and gets interpreted by e.g. Perl, then passed back into PHP, `json_decode` throws `JSON_ERROR_UTF8` – Dave Gray Mar 19 '14 at 18:09
  • `\u00e1` stands for U+00E1, the character "á". `\u00c3\u00a1` is certainly *not* "á". – deceze Mar 19 '14 at 18:12
  • U+C3A1 or \uc3a1 is the character 쎡. Don't confuse *Unicode code points* (U+... and \u...) with the physical UTF-* encoding! Maybe read [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) – deceze Mar 19 '14 at 18:15
  • Thanks for the link, that pushed me in the right direction. A small change to my Perl client fixed things: `my $j = JSON->new()->ascii();` If you submit an answer, I'll accept it. – Dave Gray Mar 19 '14 at 18:55

2 Answers2

1

This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.

deceze
  • 510,633
  • 85
  • 743
  • 889
0

try the below command to solve their problems.

<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);

Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.

For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Community
  • 1
  • 1