Decode unicode charmap (most likely non-standard) with PHP

Question

I am having this:

\u00c3\u0083\u00c2\u00b6

That stands for the german ö character (ö in html).

My issue is that I don't know what encoding it is in, I tried several decoding methods (including json_decode and mb_convert_encode('\u00c3\u0083\u00c2\u00b6','HTML-ENTITIES','UTF-8');) to get to the ö character, but not a single one worked.

I cannot look up how this was encoded in the first place, due to the fact that this is from a database dump for which the source code is unavailable.

This question is NOT a duplicate of How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

due to the fact that the charmap does not appear to be any valid UTF-8 or UTF-16 and can therefore not be decoded with any of the methods in the linked question.

That's some serious mojibake going on there. Something like UTF-8 interpreted as Latin-1 encoded to Unicode escapes, or something along those lines. Definitely something you should be fixing at the source, if it's not too late for that. — deceze, Dec 26 '17 at 20:35
I actually just need that dump, I do not need to import it again or something else that would require me to fix the code (which I don't have for the same reason). Is there any way for me to decode this mess somehow? Ideally with PHP. Thanks! — TheNiceGuy, Dec 26 '17 at 20:41
First, try a few encoding settings on the table that data is stored in. `ALTER TABLE [table] CONVERT TO CHARACTER SET [uft8_general_ci, ucs2_general_ci, etc.];` See: https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html. If your table character-encoding doesn't match the encoding when the data was stored, you'll get all kinds of problems like this. One way or another, you're going to need to identify the original encoding. — Tony Chiboucas, Dec 26 '17 at 20:42
Your data is hosed. Double-hosed, in fact. It looks like mojibake of mojibake. You need to first fix it at the source, and then use [UTF8 all the way through](https://stackoverflow.com/questions/279170/utf-8-all-the-way-through). — Sammitch, Dec 26 '17 at 21:02

score 2 · Accepted Answer · answered Dec 26 '17 at 21:21

So for reference, your source data was UTF8, and then someone ran something equivalent to utf8_encode() [which translates ISO8859-1 to UTF8, without regard to what the input actually is] on it twice.

function unescape_unicode($input) {
    return preg_replace_callback(
        '/\\\\u([0-9a-fA-F]{4})/',
        function ($match) {
            return mb_convert_encoding(
                pack('H*', $match[1]),
                'UTF-8',
                'UTF-16BE'
            );
        },
        $input
    );

}

$input = "\u00c3\u0083\u00c2\u00b6";

var_dump(
    bin2hex(
        utf8_decode( // un-mojibake #1
            utf8_decode( // un-mojibake #2
                unescape_unicode($input)
            )
        )
    )
);

Output:

string(4) "c3b6"

Where 0xc3 0xb6 is the UTF8 representation of ö.

Do NOT put this code into production. You should only use it to un-hose data that cannot be otherwise recovered or retrieved properly from underlying storage. The primary intent of the above code is to illustrate how it is broken.

This is your new bible: UTF-8 all the way through

Decode unicode charmap (most likely non-standard) with PHP

1 Answers1