My PHP script receives external JSON data from elsewhere; unfortunately, somewhere on the way, this data gets its UTF-8 characters corrupted.
For instance, I should be receiving the string "40.80 – Origin:
", but instead of it, I get something like "40.80 â Origin:
". Inspecting these around the corrupt char with hexdump
and utfinfo.pl, I get:
$ echo " – O" | perl utfinfo.pl
Got 4 uchars
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'O' u: 79 [0x004F] b: 79 [0x4F] n: LATIN CAPITAL LETTER O [Basic Latin]
$ echo " – O" | hexdump -C
00000000 20 e2 80 93 20 4f 0a | ... O.|
$ echo " â O" | perl utfinfo.pl
Got 6 uchars
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'â' u: 226 [0x00E2] b: 195,162 [0xC3,0xA2] n: LATIN SMALL LETTER A WITH CIRCUMFLEX [Latin-1 Supplement]
Char: '' u: 128 [0x0080] b: 194,128 [0xC2,0x80] n: <control> [Latin-1 Supplement]
Char: '' u: 147 [0x0093] b: 194,147 [0xC2,0x93] n: <control> [Latin-1 Supplement]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'O' u: 79 [0x004F] b: 79 [0x4F] n: LATIN CAPITAL LETTER O [Basic Latin]
$ echo " â O" | hexdump -C
00000000 20 c3 a2 c2 80 c2 93 20 4f 0a | ...... O.|
So, basically the UTF-8 byte sequence for en-dash, 0xE2,0x80,0x93, somehow got changed to 0xC3,0xA2 0xC2,0x80 0xC2,0x93. (Seemingly, I could just get rid of the 0xC2 for the second two, but I can't see how I could transform 0xC3,0xA2 back into 0xE2 for the first byte).
Anyways, I thought I could use some of PHP's built in functions to reconvert back to UTF-8, so I wrote this small test script, test_utf8.php
:
<?php
# 40.80 – Origin:
$tstr = "40.80 â Origin:";
echo "$tstr\n";
print(mb_detect_encoding ($tstr) . "\n"); // UTF-8 here
$tstrB = mb_convert_encoding($tstr, "UTF-8");
echo "$tstrB\n";
$tstrC = iconv('ASCII', 'UTF-8//IGNORE', $tstr);
echo "$tstrC\n";
$tstrD = utf8_encode($tstr);
echo "$tstrD\n";
?>
... unfortunately, it doesn't work - this is the output I get in terminal when running it via php CLI:
$ php test_utf8.php
40.80 â Origin:
UTF-8
40.80 â Origin:
PHP Notice: iconv(): Detected an illegal character in input string in /path/to/test_utf8.php on line 10
40.80 â Origin:
... that is, I corrupt everything even more. (Note that the mb_detect_encoding
detects this string as UTF-8, for some reason).
So, how can I re-convert this string back to correct UTF-8?
EDIT: (un)fortunately, SO got rid of the bad characters, so you won't be able to reconstruct this example just by copy pasting :(
, but hopefully the hexdumps provide enough info ?! If not, I reposted the above to a Github Gist, which in the raw edition seems to preserve the characters...