0

I have a table in my database where some rows are binary strings, and other are wrongly encoded :

id 1 = b"éhhh

id 2 = éhhh

If I try to convert the column

UPDATE my_table SET my_column = 
    CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
    WHERE LENGTH(my_column) != CHAR_LENGTH(my_column);

But when I do that all my binary values disapear :

id 1 = (empty)

id 2 = éhhh

Is there a way to encode all the table without losing any value ?

Here is the create statement of my table :

CREATE TABLE `my_table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `area_id` int(11) NOT NULL DEFAULT '0',
  `my_column` varchar(25) NOT NULL DEFAULT '',
  `description` varchar(60) DEFAULT NULL,
  `capacity` int(11) NOT NULL DEFAULT '0',
  `admin_email` mediumtext,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

Here are 2 exemple of HEX

église St Alban de R. = C383C2A9676C69736520537420416C62616E20646520522E

église St Jean-Baptiste = C3A9676C697365205374204A65616E2D4261707469737465

jiboulex
  • 2,963
  • 2
  • 18
  • 28
  • 1
    *where some rows are binary strings, and other are wrongly encoded* Add HEX(id) for these values. Also provide CREATE TABLE script which allows to see the datatype, charset and collation for this column. – Akina Jul 07 '22 at 10:45
  • Thank you for your comment, I edited my question with the requested informations – jiboulex Jul 07 '22 at 13:22
  • Your values are successfully inserted and retrieved. So what is the issue? https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=80877c19d599064d2b0f1c222cbdafbd (MySQL 8 is used, so UTF8 is by fact UTF8MB3). – Akina Jul 07 '22 at 13:27
  • my issue is that i retrieve é instead of é sometimes, and a binary string with é other times. I would like to have the same encoding for every rows. – jiboulex Jul 07 '22 at 13:31
  • Also my research wont work because I'm searching field with é in the value and the values with é won't appear in the results – jiboulex Jul 07 '22 at 13:31
  • You must set correct charset to this column (UTF8MB4 if supported, UTF8MB3 otherwise) and set needed default collation (or specify it explicitly in the condition expression). Also check server and connection charset and collation. – Akina Jul 07 '22 at 13:55
  • Caution: the data currently has 2 symptoms. If you fix the settings, you could end up with a 3rd type of data! Fix the data _before_ fixing the settings. – Rick James Jul 07 '22 at 17:10

1 Answers1

0

You wanted éhhh for the second one? That's called "Mojibake". I can't make out what you wanted for the first one.

C383C2A9676C697365 implies "double-encoding" for église

See both Mojibake and double encoding in Trouble with UTF-8 characters; what I see is not what I stored for a discussion of what is probably causing the issue.

If you can start over on the table -- this time getting the connection settings and table definitions all referring to utf8mb4 (or at least utf8 (currently aka utf8mb3)), that would be best.

If that is not an option, then read on...

One of the Comments implies that some rows have double encoding, while others are just Mojibake. That complicates the "fix". Fixing Mojibake will [I think] turn the double-encoded strings into just Mojibake. Then the task will be to discover which rows to "fix" again.

Rick James
  • 135,179
  • 13
  • 127
  • 222