Selecting strings from a column where some are utf8 encoded and others are not

Question

The data is the names of country subdivisions. Some have been stored as utf8 and some are not. example, this is how they are in my table:

statename

Bocas del Toro
ChiriquÃ
CoclÃ©
Colón
Darién
Veraguas
Panamá Oeste
EmberÃ¡
Kuna Yala
Ngöbe-Buglé

This question/answer gets me really close to a solution: How to fix double-encoded UTF8 characters (in an utf-8 table)

If I use: CONVERT(CAST(CONVERT(statename USING latin1) AS BINARY) USING utf8):

statename

Bocas del Toro
Chiriquí
Coclé
Col
Dari
Veraguas
Panam
Emberá
Kuna Yala
Ng

the characters stored as "é" for example, just end the string.

the variation provided in that answer ,

SELECT CASE
    WHEN CONVERT( CAST( CONVERT( statename USING latin1 ) AS BINARY ) USING utf8 ) IS NULL
        THEN statename
    ELSE CONVERT( CAST( CONVERT( statename USING latin1 ) AS BINARY ) USING utf8 )
END
FROM

returned the same result, though I am not even sure i implemented it correctly in this select.

I am not permitted to normalize this data in this case, so I would like to select it and get

Bocas del Toro
Chiriquí
Coclé
Colón
Darién
Veraguas
Panamá Oeste
Emberá
Kuna Yala
Ngöbe-Buglé

Will this be possible?

Are you sure the (latin) strings are not already truncated in the table? — Paul Spiegel, Mar 12 '19 at 17:42
yes, i do have access to the table and they appear like "Ngöbe-Buglé" there, as an example. — M Montgomery, Mar 12 '19 at 18:11
What is your server version? It seems to work in 5.7 but not in 5.6 - https://www.db-fiddle.com/f/2zp8NAior1JKo3HzbmYoei/0 — Paul Spiegel, Mar 12 '19 at 18:35
Try to change the sql_mode before you execute your statement: `set session sql_mode = concat('STRICT_TRANS_TABLES,', @@sql_mode);` — Paul Spiegel, Mar 12 '19 at 18:41
OK, maybe you fixed it today. But until you are consistent on the input _and_ change the column to be utf8, you may continue to get Mojibake or truncation. — Rick James, Mar 13 '19 at 18:45
I believe I understand what you're saying, Mr James, but like I said I'm not permitted to edit the data or the table structure in any way. It's not mine to tinker with I'm afraid. I greatly appreciate the assistance, it was very practical and helpful. I will get a lot of use out of it. — M Montgomery, Mar 14 '19 at 19:09

score 0 · Accepted Answer · answered Mar 12 '19 at 19:24

This seems to be an issue with the SQL_MODE. In order the conversion to fail and return NULL - STRICT_TRANS_TABLES mode must be set. You can set it with

SET SESSION sql_mode = CONCAT('STRICT_TRANS_TABLES,', @@sql_mode);

If you don't want to break other "working" queries in the same session, you should reset it after you've got the resutls:

SET @old_sql_mode = @@sql_mode;
SET SESSION sql_mode = CONCAT('STRICT_TRANS_TABLES,', @@sql_mode);

SELECT COALESCE(
  CONVERT( CAST( CONVERT( statename USING latin1 ) AS BINARY ) USING utf8 ), statename
) as statename
FROM yourTable;

SET SESSION sql_mode = @old_sql_mode;

DB Fiddle demo

Note: I have changed your query a bit to use COALESCE() instead of the CASE statement, so you don't need to duplicate the conversion code.

Selecting strings from a column where some are utf8 encoded and others are not

1 Answers1