0

My PHP script receives external JSON data from elsewhere; unfortunately, somewhere on the way, this data gets its UTF-8 characters corrupted.

For instance, I should be receiving the string "40.80 – Origin:", but instead of it, I get something like "40.80 â Origin:". Inspecting these around the corrupt char with hexdump and utfinfo.pl, I get:

$ echo " – O" | perl utfinfo.pl 
Got 4 uchars
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'O' u: 79 [0x004F] b: 79 [0x4F] n: LATIN CAPITAL LETTER O [Basic Latin]

$ echo " – O" | hexdump -C
00000000  20 e2 80 93 20 4f 0a                              | ... O.|

$ echo " â O" | perl utfinfo.pl 
Got 6 uchars
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'â' u: 226 [0x00E2] b: 195,162 [0xC3,0xA2] n: LATIN SMALL LETTER A WITH CIRCUMFLEX [Latin-1 Supplement]
Char: '' u: 128 [0x0080] b: 194,128 [0xC2,0x80] n: <control> [Latin-1 Supplement]
Char: '' u: 147 [0x0093] b: 194,147 [0xC2,0x93] n: <control> [Latin-1 Supplement]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'O' u: 79 [0x004F] b: 79 [0x4F] n: LATIN CAPITAL LETTER O [Basic Latin]

$ echo " â O" | hexdump -C
00000000  20 c3 a2 c2 80 c2 93 20  4f 0a                    | ...... O.|

So, basically the UTF-8 byte sequence for en-dash, 0xE2,0x80,0x93, somehow got changed to 0xC3,0xA2 0xC2,0x80 0xC2,0x93. (Seemingly, I could just get rid of the 0xC2 for the second two, but I can't see how I could transform 0xC3,0xA2 back into 0xE2 for the first byte).

Anyways, I thought I could use some of PHP's built in functions to reconvert back to UTF-8, so I wrote this small test script, test_utf8.php:

<?php
# 40.80  – Origin:
$tstr = "40.80  â Origin:";
echo "$tstr\n";
print(mb_detect_encoding ($tstr) . "\n"); // UTF-8 here

$tstrB = mb_convert_encoding($tstr, "UTF-8");
echo "$tstrB\n";

$tstrC = iconv('ASCII', 'UTF-8//IGNORE', $tstr);
echo "$tstrC\n";

$tstrD = utf8_encode($tstr);
echo "$tstrD\n";

?>

... unfortunately, it doesn't work - this is the output I get in terminal when running it via php CLI:

$ php test_utf8.php
40.80  â Origin:
UTF-8
40.80  â Origin:
PHP Notice:  iconv(): Detected an illegal character in input string in /path/to/test_utf8.php on line 10

40.80  â Origin:

... that is, I corrupt everything even more. (Note that the mb_detect_encoding detects this string as UTF-8, for some reason).

So, how can I re-convert this string back to correct UTF-8?

EDIT: (un)fortunately, SO got rid of the bad characters, so you won't be able to reconstruct this example just by copy pasting :(, but hopefully the hexdumps provide enough info ?! If not, I reposted the above to a Github Gist, which in the raw edition seems to preserve the characters...

Community
  • 1
  • 1
sdbbs
  • 4,270
  • 5
  • 32
  • 87
  • 1
    How do you get the JSON data, what is your accepted encoding? – postrel Jun 30 '16 at 12:47
  • @postrel - the accepted encoding is UTF-8, I obtain it via casperjs from a webpage (not public) which otherwise declares it as UTF-8, so I really don't understand why the corruption occurs at all; unfortunately I cannot reconstruct an example to demonstrate this in casperjs, and so am forced to correct this in PHP, if at all possible... – sdbbs Jun 30 '16 at 12:50
  • 1
    `echo " â O"` on the CLI is a bad way to test the encoding, there's such a complex chain of decoding and encoding happening in this copy-from-source-paste-to-terminal-interpret-by-CLI going on that it's impossible to say whether the result means anything. Same for your PHP test script. You'll have to directly hex dump the source. – deceze Jun 30 '16 at 12:50
  • @deceze - I have just double checked the output of my casperjs script which dumps to stdout (and which I read via shell_exec), I can confirm the hexdump of the bytes is the same as I've posted them in the OP; I think also in the gist link the actual bytes are preserved as well... – sdbbs Jun 30 '16 at 12:53
  • 1
    @sdbbs For the purpose of testing, could you get your data via `file_get_contents()` for example? – postrel Jun 30 '16 at 12:55
  • @postrel - I'm not sure how exactly would I do that in this case, but try the [Github Gist](https://gist.githubusercontent.com/anonymous/76b7d3052dbbd9b9e1c4ac7546d8b620/raw/4d55d5e84b4bac6ca6f18cff8310b0d675e456d3/test.txt) link, I've reposted the PHP script there, and it seems the actual bytes are preserved. – sdbbs Jun 30 '16 at 13:00

1 Answers1

0

I think I got it, thanks to Convert utf8-characters to iso-88591 and back in PHP:

utf8_decode — Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1

So, I tried adding to the script:

$tstrF = utf8_decode($tstr);
echo "$tstrF\n";

... and this prints out 40.80 – Origin: as it should.

Community
  • 1
  • 1
sdbbs
  • 4,270
  • 5
  • 32
  • 87