find reason for automatic encoding detection (UTF-8 vs Windows-1252)

Question

I have a CSV with content that is UTF-8 encoded. However, various applications and systems errorneously detect the encoding of the CSV as Windows-1252, which breaks all the special characters in the file (e.g. Umlauts).

I can see that Sublime Text (on Windows) for example also automatically detects the wrong Windows-1252 encoding, when opening the file for the first time, showing garbled text where special characters are supposed to be.

When I choose Reopen with Encoding » UTF-8, everything will look fine, as expected.

Now, to find the source of the error I thought it might help to figure out, why these applications are not automatically detecting the correct encoding in the first place. May be there is a stray character somewhere with the wrong encoding for example.

The CSV in question is actually an automatically generated product export of a Magento 2 installation. Recently the character encodings broke and I am currently trying to figure out what happened - hence my investigation on why this export is detected as Windows-1252.

Is there any reliable way of figuring out why the automatic detection of applications like Sublime Text assume the wrong character encoding?

The export is generated under a Linux system, I opened the file with Sublime Text under Windows. Sublime Text does not open the file with UTF-8 on MacOS as well. According to https://forum.sublimetext.com/t/how-does-sublime-detect-file-encodings/16194/5 Sublime Text will try to determine if a file is valid UTF-8 - and if it is not, `Windows-1252`(`cp-1252`?) will be the fallback. And I'd like to know _why_ this particular file fails to be detected as valid UTF-8. — fritzmg, Feb 11 '19 at 12:54
Although UTF-8 doesn't usually require a BOM, adding one might help these hapless Windows utilities detect the encoding correctly. How exactly to do that with a Magento export is beyond me, but if you can add a manual step in between, just prepend the three bytes before copying the file over. — tripleee, Feb 11 '19 at 12:57
Could you check if you didn't mix both encoding somewhere in the file? Try with an other editor. — Giacomo Catenazzi, Feb 11 '19 at 14:16
Unreliability follows from detection. Just tell the program what you know. — Tom Blodget, Feb 12 '19 at 18:03
Maybe you can try the command suggester here: https://stackoverflow.com/a/41741313/2440340 to double check if your file is actually a valid UTF-8 encoded text file. — Antoine Mottier, Feb 12 '19 at 20:33
If multiple detectors are failing to detect UTF-8, then the file is obviously NOT encoded in UTF-8, plain and simple. If it is SUPPOSED to be encoded in UTF-8, then it has an error in it. The only way to find out is to LOOK AT THE RAW DATA and work out which byte(s) are incorrect. UTF-8 is a very simple encoding to implement by hand. It would also help if you know what the original text is so you can know what the UTF-8 encoded form is supposed to look like, and then compare that to your file. — Remy Lebeau, Feb 12 '19 at 23:36
@RemyLebeau yes, the file had mixed contents. The question was how to find out which characters were not UTF-8 encodet, because the majority is and it wasn't easy to immediately identify the characters with wrong encoding. I'll be posting an answer how I then found out what was going on. — fritzmg, Feb 14 '19 at 10:35

score 2 · Accepted Answer · answered Feb 14 '19 at 10:47

This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.

$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);

Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.

This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:

$value = preg_replace('/([^\pL0-9 -])+/', '', $value);

Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

1 Answers1