UTF-8 encoding of UTF-8 encoded text is not the same as original UTF-8 encoded text

Question

Here is a PHP code snippet I came up with when I found a bug in my project.

print(($str == utf8_encode($str) ? "the same text" : "not the same text") . PHP_EOL);
print(mb_detect_encoding($str));

Now what this does, is tell me if a string $str has the same encoding as its UTF-8 encoded version, after that it prints its initial encoding.

What I expected is that either the UTF-8 text is the same as the original, or that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original.

But what really happened is the following output:

not the same text
UTF-8

This is only the case if i set $str = array_keys($_POST)[0]; and i use a key with special characters in my request body like äöü=test so that the $str will be äöü (defining it directly in the code will not result in the same output).

I interpret from the output that the original character encoding is UTF-8, but the two strings are not the same. If I print the initial string it is empty and the encoded string would be äöü.

I don't understand how a string can be different when encoded with its own encoding. Can someone please explain this to me?

I'd guess that it has mostly to do with default encoding as `print(utf8_encode($str));` returns `Ã¤Ã¶Ã¼`. — JosefZ, Dec 26 '20 at 15:28

score 2 · Answer 1 · answered Dec 26 '20 at 14:37

2

The problem is your assumption that "that the original text is already UTF-8 and therefore the UTF-8 encoded text is the same as the original".

From the PHP Official Documentation regarding utf8_encode (https://www.php.net/manual/en/function.utf8-encode.php):

This function converts the string data from the ISO-8859-1 encoding to UTF-8.

In other words, this function is a ISO-8859-1 to UTF-8 converter. A proper use of this function, as seen above, expects only a ISO-8859-1 string. Therefore, if you use another encoding as parameter you should expect garbage.

This thread (PHP: Convert any string to UTF-8 without knowing the original character set, or at least try) discuss an "any character enconding to UTF-8".

Hope it hepls

answered Dec 26 '20 at 14:37

Jefferson Lopes

46
3

Well that makes sense, thank you for your answer. Nevertheless, my problem was that if I printed `$str` before re-encoding it, the output was an empty string. I forgot to mention that in my question (I updated it). – täm Dec 26 '20 at 14:43
@täm there are many reasons for that; you should `print_r()` or `var_dump()` the original text to show the exact data. – Martin Dec 26 '20 at 15:15
@Martin var_dump of äöü returns `string(3) ""` – täm Dec 26 '20 at 19:19
@täm really? Or is it just what you "see"? Or how the browser interpretes it? Call your website and save the response straight to a file (i.e. "save link as"). Then view it in a hex editor and tell us which bytes you actually see between the quotes. – AmigoJack Dec 28 '20 at 01:26

UTF-8 encoding of UTF-8 encoded text is not the same as original UTF-8 encoded text

1 Answers1