4

I have created a function to convert the following text to UTF-8, as it appeared to be in Windows-1252 format, due to being copied to a database table from a Word Document.

Testing weird character’s correction

This seems to fix the dodgy ’ character. However i'm not getting � in the following:

Devon�s most prominent dealerships

When passing the following through the same function:

Devon's most prominent dealerships

Below is the code which does the converting:

function Windows1252ToUTF8($text) {
    return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}

Edit: The database can't be changed due to holding thousands of custom records. I tried the below but the mb_detect_encoding thinks character’s correction is UTF-8.

function Windows1252ToUTF8($text) {
    if (mb_detect_encoding($text) == "UTF-8") {
        return $text;
    }
    return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}

Edit 2: Just tried the example from the PHP Documentation:

$str = 'áéóú'; // ISO-8859-1
echo "<pre>";
var_dump(mb_detect_encoding($str, 'UTF-8')); // 'UTF-8'
var_dump(mb_detect_encoding($str, 'UTF-8', true)); // false
echo "</pre>";
die();

but this simply outputs:

string(5) "UTF-8" string(5) "UTF-8"

So I can't even detect the encoding of the string :S

Edit 3: This seems to do the trick:

function Windows1252ToUTF8($text) {
    $badChars = [ "â", "á", "ú", "é", "ó" ];
    $match = preg_match("/[".join("",$badChars)."]/", $text);
    if ($match) {
        return mb_convert_encoding($text, "Windows-1252", "UTF-8");
    }
    return $text;
}

Edit 4: I have matched the hex values to their corresponding values. However when I get to the weird characters they don't appear to match.

enter image description here

Martyn Ball
  • 4,679
  • 8
  • 56
  • 126
  • if you're trying to output from a database (which sounds like it to me), you need to pass UTF-8 in the connection before querying. Have you gone through this yet? https://stackoverflow.com/questions/279170/utf-8-all-the-way-through?rq=1 – Funk Forty Niner Feb 20 '18 at 12:22
  • 1
    If it's MSWord, it's probably a "smart" quote, i.e. the character `’`. This indeed results in � in the output. – Ynhockey Feb 20 '18 at 12:23
  • See updated post @FunkFortyNiner – Martyn Ball Feb 20 '18 at 12:25
  • *"The database can't be changed due to holding thousands of custom records."* - Adding UTF-8 to the connection parameter isn't "changing" a database, just the method of connecting to it. – Funk Forty Niner Feb 20 '18 at 12:26
  • @FunkFortyNiner, just checked the database connection and this has already been added. – Martyn Ball Feb 20 '18 at 12:30
  • what about the file's encoding? This will matter. Try changing to ANSI, then UTF-8 (with and without BOM); one of those should pan out. I had the same problem before. – Funk Forty Niner Feb 20 '18 at 12:33
  • I mean i'm just testing this is one of the many PHP files which our websites run through, and the PHP File will be UTF-8. But the string from the database is not, which is the issue. In the past we have just done a load of string replacements, but it's a bit dirty that way. See updated post @FunkFortyNiner – Martyn Ball Feb 20 '18 at 12:39
  • 1
    I don't know how you're using that file to fetch/display, but this is what I had to do once when faced with a similar problem. `header ('Content-type: text/html; charset=iso8859-15');` on top, then `$file_x = "/path/to/file.xxx"; $file = file_get_contents("$file_x", FILE_USE_INCLUDE_PATH); $file = utf8_encode ( $file );` - I don't know if this will work for you. Did you try the file encoding comment I left earlier? – Funk Forty Niner Feb 20 '18 at 12:54
  • Hmm, couldn't seem to get it working. I have added my solution to my post, seems to work for now. – Martyn Ball Feb 20 '18 at 13:05
  • @MartynBall you should have posted that as an answer instead. – Funk Forty Niner Feb 20 '18 at 13:07
  • @FunkFortyNiner Good point, thanks for the help. Added the answer. – Martyn Ball Feb 20 '18 at 13:09
  • It's unclear whether the data is wrong, or you're interpreting the data incorrectly and that's the only reason it screws up. Show `echo bin2hex($theString)` to see what the actual bytes are, which allows us to judge how it needs to be converted if at all. – deceze Feb 20 '18 at 13:09
  • You're welcome @Martyn – Funk Forty Niner Feb 20 '18 at 13:09

2 Answers2

3

Converting Testing weird character’s correction using bin2hex gives me 54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e

This means the "’" is actually the bytes \xc3\xa2\xe2\x82\xac\xe2\x84\xa2. This is a typical sign of a UTF-8 string having been interpreted as Windows Latin-1/1252, and then re-encoded to UTF-8.

(UTF-8 \xe2\x80\x99)
→ bytes interpreted as Latin-1 equal the string ’
→ characters encoded to UTF-8 result in \xc3\xa2\xe2\x82\xac\xe2\x84\xa2

To restore the original, you need to reverse that chain of mis-encodings:

$s = "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2";
echo mb_convert_encoding($s, 'Windows-1252', 'UTF-8');

This interprets the string as UTF-8, converts it to the Windows-1252 equivalent, which is then the valid UTF-8 representation of .

Preferably you figure out at what point the encoding screwed up like this and you stop that from happening in the future. If it happened by "copy and pasting from Word", then basically somebody pasted garbage into your database and you need to fix the workflow with Word somehow. Otherwise there may be an incorrect encoding-conversion step somewhere in your code which you need to fix.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • @Martyn I realise this basically is the same solution as you already found, but at least it provides a reason and gives you something to diagnose the root cause. – deceze Feb 20 '18 at 14:00
  • 1
    Thanks for the explanation, this is very useful. I will mark this as the answer due to the information you have given me. I believe the issue does just stem from Word copy and pasting, we are working on backend systems to avoid doing this. – Martyn Ball Feb 21 '18 at 10:05
1

The following seems to do the trick. Not the way I wanted it to work by checking for specific characters, but it does the trick.

function Windows1252ToUTF8($text) {
    $badChars = [ "â", "á", "ú", "é", "ó" ];
    $match = preg_match("/[".join("",$badChars)."]/", $text);
    if ($match) {
        return mb_convert_encoding($text, "Windows-1252", "UTF-8");
    }
    return $text;
}

Edit:

function Windows1252ToUTF8($text) {
    // http://www.fileformat.info/info/charset/UTF-8/list.htm
    $illegal_hex = [ "c3a2", "c3a1", "c3ba", "c3a9", "c3b3" ];
    $match = preg_match("/".join("|",$illegal_hex)."/", bin2hex($text));
    if ($match) {
        return mb_convert_encoding($text, "Windows-1252", "UTF-8");
    }
    return $text;
}
Martyn Ball
  • 4,679
  • 8
  • 56
  • 126
  • What exactly this code does depends on the encoding the .php file is saved with. It's also a bad idea to try to treat encoding issues on an individual character basis. You need to figure out where exactly your encoding goes sideways and why, not just whack-a-mole like this. – deceze Feb 20 '18 at 13:19
  • @deceze that's the issue I was trying to solve. It's down to differently encoded values being pasted into the table. – Martyn Ball Feb 20 '18 at 13:24
  • They can't be "differently encoded", at worst they're *mojibake*. Which means you need to reverse the incorrect transcoding that happened at some point (and preferably prevent such transcoding from happening in the future). To do that you need to know what values exactly you're working with, for which you need to look at the actual bytes. → https://stackoverflow.com/questions/48885180/converting-window-1252-to-utf-8-issue#comment84776714_48885180 – deceze Feb 20 '18 at 13:27
  • Converting `Testing weird character’s correction` using bin2hex gives me `54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e` – Martyn Ball Feb 20 '18 at 13:32
  • @deceze `e282`, `ace2`, `84a2` don't seem to correspond to a hex charset. – Martyn Ball Feb 20 '18 at 13:47