1

I'm trying to figure out something which seems odd to me.

A little background first.

We are using MySqlConnector version 8.0.23 and we noticed that the character encoding used is WINDOWS-1252

Our database is defined as the character set latin1.

We saw that MySQL server saves the data as a byteArray, also the MySQL connector function that reads string from MySQL server is as follows:

    public String createFromBytes(byte[] bytes, int offset, int length, Field f) {
        return StringUtils.toString(bytes, offset, length,
                f.getCollationIndex() == CharsetMapping.MYSQL_COLLATION_INDEX_binary ? this.pset.getStringProperty(PropertyKey.characterEncoding).getValue()
                        : f.getEncoding());
    }

Basically, read a byte array with some encoding and return a string for it. (here we saw it used WINDOWS-1252 encoding).

So we tried to think, how does the character_set defined in the MySQL server is relevant here?

If the connector is the one specifying the encoding and decoding and the server only holds byteArray it seems like the MySQL server configuration is not used.

So we thought maybe it's used in statements performed by MySQL itself (order by , join ... etc)

And we thought that was the case, however we saw something else that now doesn't make sense.

We had another issue where we got an emoji and got the following error message on insert.

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x89",...' for column...

MySQL Connector was on version 8.0.27 (there wasn't any character encoding defined so it should use UTF8 as far as I know)

and the column was defined as utf8 (alias to utf8mb3)

In addition, when we downgraded MySql connector to 8.0.23 (which used Windows-1252 character set)

we didn't get an error from the server - the emoji was saved as '?'.

what is going on here?

I saw this post How to store Emoji Character in MySQL Database

which says to define the connection and server to utf8mb4.

I would expect it also to work when configuring only the connection to utf8mb4.

As the emoji will be encoded properly to the byteArray The byteArray will be saved in the column The byteArray will be read as utf8MB4

Thank you

  • Collation is used to compare characters, that's why `select 'e'='é' collate utf8mb4_0900_ai_ci;` returns `true`. Character_sets are used to define the characters. When storing emoji's you should use `utf8mb4` (see the question you linked). (see: [DBFIDDLE](https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=a159ab20c497ebc2f09a1c3a1070c797)) – Luuk Mar 30 '22 at 10:02
  • I understand what needs to be done to make it work. what I don't understand that if the DB is saving the data as bytearray , why do we need to define the character set for it? The MySQL connector configurations should be enough for the read and write. Or in other words, why do we need to define the utf8mb4 on the MySQL server itself, if everything string he saves is using bytearray? Feels like I'm missing something – Shachar Levy Mar 30 '22 at 10:30
  • I do not know your code, and I would not think it 'smart' when the `createFromBytes` is used to read all strings from the database. – Luuk Mar 30 '22 at 11:37
  • Luuk, this is not my code - this is MySQL Connector code see here - https://github.com/mysql/mysql-connector-j/blob/release/8.0/src/main/core-impl/java/com/mysql/cj/result/StringValueFactory.java – Shachar Levy Mar 30 '22 at 11:53
  • One _must_ specify the encoding in the client by one mechanism and the encoding in the table by a different mechanism. There is no single setting that does both. Failure to set both will lead to garbage. – Rick James Mar 30 '22 at 16:00
  • If I look at a specific scenario. let's say I only use the DB to save information (without queries). From what I saw, the MySQL connector chooses the encoding to use for write and reading. MySQL saves the data as byte array - so in this case, it doesn't really matter what is defined in the MySQL server right? – Shachar Levy Mar 31 '22 at 06:24
  • Do `SHOW CREATE TABLE`. Are the columns declared to be `BINARY(...)` or `BLOB`? Those are the MySQL equivalent of "byte array". See this for many of the nasties you can get when not handling the `CHARACTER SET` consistently: https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Apr 01 '22 at 23:48

0 Answers0