List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

Question

PHP isn't always correct, what I write has to always be correct. In this case an email with a subject contains an en dash character. This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP. I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.

Here is my starting string from the subject header of an email:

<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>

Typically the next step is to do the following:

$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo

Typically past that point I'd do the following:

$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!

Now I received a static answer for an earlier static question. Eventually I was able to put this working set of code together:

<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!

//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);

//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}

//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>

Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode, iconv, mb_convert_encoding and utf8_encode) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.

So technically the question is:

How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?

If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string. Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.

Just to be clear, `mb_detect_encoding` is correctly identifying that string as ASCII, that's the whole point of RFC1342, just like base64 can encode binary and you'd want `mb_detect_encoding` to detect that the encoded string is ASCII, too. Looking at [ISO-8859-1](https://www.ecma-international.org/publications-and-standards/standards/ecma-94/), 0x80 through 0x9f are all undefined, so generally the correct answer is to either omit it or use a replacement character. As your linked question stated, however, you could also guess/infer from other data. — Chris Haas, Nov 20 '21 at 13:04
I don't know how to do it in PHP, but CP1252 is a superset of 8859-1, and can be used instead in basically all cases, because many sources _lie_ about their encodings. In a similar python library, I just changed the codec map so that iso-8859-1/latin1 would always trigger cp1252 decoding. This is not actually a bug in PHP, it's a (common) bug in the sender. — Max, Nov 20 '21 at 14:29
IMHO it is _impossible_ to reliably _detect_ a string's encoding, at best you can only make an educated guess. String encoding is _metadata_ that you _must explicitly know_. — Sammitch, Nov 22 '21 at 19:25

IMSoP · Answer 1 · 2021-11-22T17:23:41.017

You are blaming PHP for something that PHP could not possibly solve:

$s1 is an ASCII string; just as the string "smiling face emoji" is ASCII, even though it describes the string "".
$s2 is decoded according to the information you were sent. In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.

Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.

Those two encodings agree on the meanings of 224 out of the 256 possible 8-bit values. They disagree on the values from 0x80 to 0x9F: those are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252.

Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing (for instance) 0x96. However, the extra control characters from ISO 8859 are very rarely used, so if the string claims to be ISO-8859-1 but contains bytes in that range, it's almost certainly in some other encoding. Since Windows-1252 is very widely used (and often mislabelled in this way), a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252.

That makes the solution really very simple:

// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';

// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;

// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
    $input_encoding = 'Windows-1252';
}

// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

Do all characters covered by `ISO-8859-1` fit cleanly in to `Windows-1252` then or is there a reverse where not all `ISO-8859-1` characters fit cleanly in to `Windows-1252`? — John, Nov 22 '21 at 16:58
@John It's not really a case of "fitting in": both encodings give a meaning to all 256 values which can be represented in 8 bits. The important thing is that _they both give the same meaning_ to 224 of those values; the other 32 are the ones I mention: ISO-8859-1 gives them a meaning which is almost never used. If you see any value in the range 0x80 to 0x9F it's almost certainly _not_ intended to be read as one of those control codes; it _could_ be intended to be read according to one of many different encodings, but _the most likely_ is Windows-1252, simply because of how commonly used it is. — IMSoP, Nov 22 '21 at 17:20

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

1 Answers1

Linked