How to convert malformed database characters (ascii to utf-8)

Question

I know many people will say this has already been answered like so https://stackoverflow.com/a/4983999/1833322 But let me explain why it's not just as straight forwarded..

I would like to use PHP to convert something "that looks like ascii" into "utf-8"

There is a website which does this https://onlineutf8tools.com/convert-ascii-to-utf8

When i input this string Zâ€¦Z i get back Z⬦Z which is the correct output.

I tried iconv and some mb_ functions. Though i can't figure out if these functions are capable of doing what i want or which options that i need. If it's not possible with these functions some self-written PHP code would be appreciated. (The website runs javascript and i don't think PHP i less capable in this regard)

To be clear: the goal is to recreate in PHP what that website is doing. Not to have a semantic debate about ascii and utf-8

EDIT: the website uses https://github.com/mathiasbynens/utf8.js which says

it can encode/decode any scalar Unicode code point values, as per the Encoding Standard.

Standard linking to https://encoding.spec.whatwg.org/#utf-8 So this library says it implements the standard, then what about PHP ?

Álvaro González · Accepted Answer · 2020-05-02T11:59:24.210

UTF-8 is a superset of ASCII so converting from ASCII to UTF-8 is like converting a car into a vehicle.

+--- UTF-8 ---------------+
|                         |
|   +--- ASCII ---+       |
|   |             |       |
|   +-------------+       |
+-------------------------+

The tool you link seems to be using the term "ASCII" as synonym for mojibake (it says "car" but means "scrap metal"). Mojibake typically happens this way:

You pick a non-English character: ⬦ 'WHITE MEDIUM DIAMOND' (U+2B26)
You encode it using UTF-8: 0xE2 0xAC 0xA6
You open the stream in a tool that's configured to use the single-byte encoding that's widely used in your area: Windows-1252
You look up the individual bytes of the UTF-8 character in the character table of the single-byte encoding:
- 0xE2 -> â
- 0xAC -> ¬
- 0xA6 -> ¦
You encode the resulting characters in UTF-8:
- â = 'LATIN SMALL LETTER A WITH CIRCUMFLEX' (U+00E2) = 0xC3 0xA2
- ¬ = NOT SIGN' (U+00AC) = 0xC2 0xAC
- ¦ = 'BROKEN BAR' (U+00A6) = 0xC2 0xA6

Thus you've transformed the UTF-8 stream 0xE2 0xAC 0xA6 (⬦) into the also UTF-8 stream 0xC3 0xA2 0xC2 0xAC 0xC2 0xA6 (â¬¦).

To undo this you need to reverse the steps. That's straightforward if you know what proxy encoding was used (Windows-1252 in my example):

$mojibake = "\xC3\xA2\xC2\xAC\xC2\xA6";
$proxy = 'Windows-1252';
var_dump($mojibake, bin2hex($mojibake));
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
var_dump($original, bin2hex($original));

string(6) "â¬¦"
string(12) "c3a2c2acc2a6"
string(3) "⬦"
string(6) "e2aca6"

But it's tricky if you don't. I guess you can:

Compile a dictionary of the different byte sequences you get in the different single-byte encodings and then use some kind of bayesian inference to figure out the most likely encoding. (I can't really help you with this.)

Try the most likely encodings and visually inspect the output to determine which is correct:

// Source code saved as UTF-8
$mojibake = "Zâ€¦Z";
foreach (mb_list_encodings() as $proxy) {
    $original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
    echo $proxy, ': ', $original, PHP_EOL;
}

If (as in your case) you know what the original text is and you're kind of sure that you don't have mixed encodings, do as #2 but trying all the encodings PHP supports:

// Source code saved as UTF-8
$mojibake = 'Zâ€¦Z';
$expected = 'Z⬦Z';
foreach (mb_list_encodings() as $proxy) {
    $current = @mb_convert_encoding($mojibake, $proxy, 'UTF-8');
    if ($current === $expected) {
        echo "$proxy: match\n";
    }
}

(This prints wchar: match; not really sure what that means.)

Is there a php function to go from `â€¦` to `\xC3\xA2\xC2\xAC\xC2\xA6`? This seems to be a missing step at the first start. Otherwise great answer! I will put a small bounty as reward. — Flip, May 02 '20 at 13:21
If your editor is configured to save files as UTF-8 there's simply no difference between both strings. If you need the bytes you have [bin2hex](https://php.net/bin2hex) or just use [Unicode Inspector](https://apps.timwhitlock.info/unicode/inspect?s=%C3%A2%E2%82%AC%C2%A6). — Álvaro González, May 02 '20 at 14:19
I tried this with two hex editors (bless and wxHexEditor) and inspected a file with contents `â€¦`. Both editors show `C3 A2 E2 82 AC C2 A6`. — Flip, May 02 '20 at 15:16
That's correct for `â€¦` (coming from `wchar`). My example was `â¬¦` (coming from `Windows-1252`). — Álvaro González, May 02 '20 at 16:05

How to convert malformed database characters (ascii to utf-8)

1 Answers1