1

I've got a MySQL database with latin1 encoding, and I'm struggling with function SUBSTRING() which is obviously counting bytes and not characters, as shown by the following scenario:

MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)

MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3)      |
+-------------------------------+
| ééé                           |
+-------------------------------+

Everything normal up to now, let's switch the connection to latin1 encoding.

MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)

MySQL [hozana]> SELECT SUBSTRING('ééééé', 1, 3);
+-------------------------------+
| SUBSTRING('ééééé', 1, 3)      |
+-------------------------------+
| é�                             |
+-------------------------------+

The only way I found right now, is to convert string to utf-8 before function SUBSTRING() and convert it back to latin1 afterwards. Which is very ugly...

MySQL [hozana]> select convert(cast(convert(substring(convert(cast(convert('éééé' using  latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1);
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| convert(cast(convert(substring(convert(cast(convert('éééé' using  latin1) as binary) using utf8), 1, 3) using utf8) as binary) using latin1)     |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| ééé                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------+

My question is, which is the right configuration to make in order to have SUBSTRING() working in latin1?

Note

Here is the configuration before and after set names:

MySQL [hozana]> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 5.5.54    |
+-----------+

MySQL [hozana]> set names utf8;
Query OK, 0 rows affected (0.00 sec)

MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   |
| character_set_connection | utf8   |
| character_set_database   | latin1 |
| character_set_filesystem | binary |
| character_set_results    | utf8   |
| character_set_server     | latin1 |
| character_set_system     | utf8   |
+--------------------------+--------+

MySQL [hozana]> set names latin1;
Query OK, 0 rows affected (0.00 sec)

MySQL [hozana]> SHOW SESSION VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | latin1 |
| character_set_connection | latin1 |
| character_set_database   | latin1 |
| character_set_filesystem | binary |
| character_set_results    | latin1 |
| character_set_server     | latin1 |
| character_set_system     | utf8   |
+--------------------------+--------+
etienne callies
  • 23
  • 1
  • 1
  • 5

1 Answers1

1

User error.

When you say SET NAMES latin1, you are announcing to MySQL that the bytes coming from the client (you) are encoded in latin1. But they weren't. They were still in utf8.

When you typed ééééé, the bytes generate were these 10 bytes C3A9C3A9C3A9C3A9C3A9 Those were sent to mysql as 10 latin1 characters, namely ééééé. SUBSTRING, as requested, carved off the first 3 characters (but they were latin1 characters: éÃ, hex C3A9C3 and delivered them back to your UTF-8 client, which proceeded to interpret C3A9 as é, then gagged on the invalid UTF-8, hex C3, and puked on your terminal with its black diamond (the "REPLACEMENT CHARACTER").

So, always be sure to establish the encoding of the client, either via something in the connection mechanism or with SET NAMES. All sorts of nasties can occur if you specify it incorrectly. Alas, this does not address your problem directly; but it addresses a lot of other things that can happen.

Oh, another thing. You say you have "a MySQL database with latin1 encoding". That is OK. You must still specify the client to be encoded in (apparently) utf8 or utf8mb4. MySQL will convert to the encoding of the column when you do an INSERT, and convert back the other way when you do a SELECT. Since é exists in latin1 as well as utf8, (and ditto for all other Western European accented letters), all should be well.

Perhaps you crafted the Question with a literal. Well, that does not necessarily reflect SELECTing from a table. So, I crafted a table with both a latin1 column and a utf8 column, each containing ééééé, verified that the HEX and LENGTH were different. Then testing SELECT SUBSTRING(col, 1, 3) correctly produced ééé in both cases.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Many many thanks for your accurate reply, nobody really explained me the encoding staff with mysql, and you did it very well! I know what's wrong with my code now. Cheers – etienne callies Dec 17 '18 at 17:38
  • @etiennecallies - Thanks. If it is any consolation, it took me years to understand the _many_ things that can go wrong with MySQL charsets, and be able to explain it. Furthermore, your case was something that I have never seen before. I had to run some tests to figure out how to reproduce it, then adapt my understanding to explain it. – Rick James Dec 17 '18 at 17:56