I'm hearing that the single Unicode character ℃
(hex E28483
) is turning into the two Unicode characters °C
(hex C2B0 43
). Let's verify this. If encoded as latin1, the latter hex will be B0 43
.
If the character(s) are in the database, then do
SELECT col, HEX(col) FROM ...
If they are in PHP, then do
echo bin2hex($str);
Then report back which hex you get.
This discusses why the two character version could turn into ?C
. It suggests:
- The bytes to be stored are not encoded as utf8/utf8mb4. In particular, hex
B0
is the latin1 encoding for °
.
- The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
- Also, check that the connection during reading is UTF-8.
Meanwhile, there is nothing (that I know of) in either MySQL or PHP that would turn the one-char encoding into the two-char version. Is there any other process involved?
In the Unicode specification, there is a "Decomposition" of the 1-char version into the 2-char version, but I don't know what product would make use of such. Another example: Lj
vs Lj
Who is converting?
If MySQL were converting from utf8 to latin1, I would expect
CONVERT(CONVERT(BINARY('℃') USING utf8) USING latin1)
to return the two-char version. But, no, it returns '?'
. I have to assume some other process that the data is going through is being kind enough to convert the 1-char thing into 2-chars, perhaps then converting to latin1 (which is almost identical to cp1252 and ISO-8859-1 and ISO-8859-15)