UTF-8 is a superset of ASCII so converting from ASCII to UTF-8 is like converting a car into a vehicle.
+--- UTF-8 ---------------+
| |
| +--- ASCII ---+ |
| | | |
| +-------------+ |
+-------------------------+
The tool you link seems to be using the term "ASCII" as synonym for mojibake (it says "car" but means "scrap metal"). Mojibake typically happens this way:
You pick a non-English character: ⬦
'WHITE MEDIUM DIAMOND' (U+2B26)
You encode it using UTF-8: 0xE2 0xAC 0xA6
You open the stream in a tool that's configured to use the single-byte encoding that's widely used in your area: Windows-1252
You look up the individual bytes of the UTF-8 character in the character table of the single-byte encoding:
0xE2
-> â
0xAC
-> ¬
0xA6
-> ¦
You encode the resulting characters in UTF-8:
Thus you've transformed the UTF-8 stream 0xE2 0xAC 0xA6
(⬦
) into the also UTF-8 stream 0xC3 0xA2 0xC2 0xAC 0xC2 0xA6
(⬦
).
To undo this you need to reverse the steps. That's straightforward if you know what proxy encoding was used (Windows-1252 in my example):
$mojibake = "\xC3\xA2\xC2\xAC\xC2\xA6";
$proxy = 'Windows-1252';
var_dump($mojibake, bin2hex($mojibake));
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
var_dump($original, bin2hex($original));
string(6) "⬦"
string(12) "c3a2c2acc2a6"
string(3) "⬦"
string(6) "e2aca6"
But it's tricky if you don't. I guess you can:
Compile a dictionary of the different byte sequences you get in the different single-byte encodings and then use some kind of bayesian inference to figure out the most likely encoding. (I can't really help you with this.)
Try the most likely encodings and visually inspect the output to determine which is correct:
// Source code saved as UTF-8
$mojibake = "Z…Z";
foreach (mb_list_encodings() as $proxy) {
$original = mb_convert_encoding($mojibake, $proxy, 'UTF-8');
echo $proxy, ': ', $original, PHP_EOL;
}
If (as in your case) you know what the original text is and you're kind of sure that you don't have mixed encodings, do as #2 but trying all the encodings PHP supports:
// Source code saved as UTF-8
$mojibake = 'Z…Z';
$expected = 'Z⬦Z';
foreach (mb_list_encodings() as $proxy) {
$current = @mb_convert_encoding($mojibake, $proxy, 'UTF-8');
if ($current === $expected) {
echo "$proxy: match\n";
}
}
(This prints wchar: match
; not really sure what that means.)