0

I want to select a string containing a micro character (µ) and a superscript from a table and insert it into another table. Both the source field and the destination field are formatted in utf8mb4_unicode_ci (not the whole table, just the fields)

+-----------------+-------------+------+-----+---------+-------+
| Field           | Type        | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+-------+
| type_name       | varchar(30) | NO   | PRI | NULL    |       |
| unit_of_measure | varchar(5)  | NO   |     | NULL    |       |
+-----------------+-------------+------+-----+---------+-------+

I tried using this command

INSERT INTO dest_table (type, unit_of_measure)
SELECT DISTINCT type, unit_of_measure FROM source_table;

to just insert them by selecting them, but: a SELECT from the source table gives me the correct result

SELECT unit_of_measure FROM source_table LIMIT 1

Result: µg/m³

And a select from the destination table gives me a "stripped"(?) result:

SELECT unit_of_measure FROM dest_table LIMIT 1

Result: µg/m (expected: µg/m³) OR a result with an unknown character:
Result: ng/m� (expected: ng/m³)

This is the dataset I'm using, the column I need is the "UnitaMisura" one. I'm using PDO with a local MySQL connection.

Any help is appreciated.

Edit 1:

SELECT DISTINCT unit_of_measure, HEX(unit_of_measure) FROM source_table

Result: C382C2B5672F6DC382C2B3 (µg/m³)

SELECT DISTINCT unit_of_measure, HEX(unit_of_measure) FROM dest_table

Result: C382C2B5672F6D (µg/m)

FIX:
I had to set the PDO connection charset, by adding charset=utf8mb4 at the end of the connection string.
Example: new PDO("mysql:host=$db_host;dbname=$db_name;charset=utf8mb4", $db_user, $db_password)

Knocks
  • 11
  • 1
  • 5
  • Please provide `SELECT DISTINCT unit_of_measure, HEX(unit_of_measure) FROM ...` I may help to see what was actually stored in the table. – Rick James Jun 07 '20 at 21:18
  • @RickJames correct: `C382C2B5672F6DC382C2B3`, wrong: `C382C2B5672F6D` – Knocks Jun 08 '20 at 07:11
  • To find the exact problem spot, you have to create an isolated working example. It means an SQL dump, that involves everything: two tables creation, inserting data into the first table, selecting data and insert-select into the second table. During the process, you will spot the actual issue. Or, in the unlikely event you won't, you will have an op topic answerable question for stack overflow. – Your Common Sense Jun 08 '20 at 07:36
  • @YourCommonSense - Yes, such a test case would help. However, the cause seems to be in the configuration of the table or connection or client. The HEX that the OP provided narrowed it down to one of about 5 different scenarios. Since there are about 4 places there UTF-8 might be missed, the "UTF-8 all the stay thru" is too broad (in my opinion). – Rick James Jun 08 '20 at 16:31

1 Answers1

0

Yes, "utf-8 all the way through" is a valid discussion. But this case goes beyond that.

It is "double-encoding" performed twice. See Trouble with UTF-8 characters; what I see is not what I stored for cause and discussion of "double encoding".

Somehow, your code fed the data through a second time

µ    g  /  m  ³
B5   67 2F 6D B3      (properly encoded in latin1)
C2B5 67 2F 6D C2B3    (properly encoded in utf8 or utf8mb4)
 µ  g  /  m   ³     (misinterpreting as latin1)

Then, feeding through again, C2 becomes C382 and B5 becomes C2B5, etc:

C382 C2B5 67 2F 6D C382 C2B3  (treating the utf8 as latin1 and converting again)

It would be possible to unravel the mess you now have, but it would be better to fix the code:

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

Note: The above discussion deals with the incorrect encoding having been INSERTed into the table. There may also be issues on the SELECT side.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Just updated my answer, my problem was that I didn't specify the charset in my PDO object. I've added the fix in the answer. – Knocks Jun 09 '20 at 08:26