String with encoding from MYSQL DB Query comes through as mangled in Java

Question

Using Java, I am connecting to a MYSQL database where I query a table to glean a field from it.

The field has UTF8 encoded text, that is:

Córas Éireann

The above text is what I see when I log in to the MYSQL console and look at that row in the table. The mysql version deployed is: mysql Ver 14.14 Distrib 5.1.52, for unknown-linux-gnu (x86_64) using readline 5.1

If I use a python program to connect to the same database and query the same table and get the same row, the text looks correct, i.e. it comes as Córas Éireann

However, when I query it through Java, the text comes out as:

CÃ³ras Ã‰ireann

which I suspect is Western (ISO latin1), but I am not sure, just a guess.

I did a show table status and saw that the table I am querying has Collation of utf8_general_ci

The table I am querying does not have any adddtionaly encoding definitions, the field, name, is a varchar(512).

+--------------------+--------------+------+-----+---------+----------------+
| Field              | Type         | Null | Key | Default | Extra          |
+--------------------+--------------+------+-----+---------+----------------+
| id                 | int(11)      | NO   | PRI | NULL    | auto_increment |
| name               | varchar(512) | YES  |     | NULL    |                |

The SQL connect JAR I imported is mysql-connector-java-5.1.36, I also tried v 5.1.34 and 5.0.8 but that made no difference.

This is how I connect to the DB:

    String dbStr =
    String.format("jdbc:mysql://%s:%d/%s?useUnicode=yes&characterEncoding=UTF-8", LOCAL_MYSQL_HOST,
                                       LOCAL_MYSQL_PORT, LOCAL_MYSQL_DB);
    try {
        cdb = DriverManager.getConnection(dbStr, LOCAL_MYSQL_USER, LOCAL_MYSQL_PASS);
        Statement dbStatement = cdb.createStatement();
        String query = String.format("SELECT name FROM customer WHERE id=%d",customerId);
        ResultSet row = dbStatement.executeQuery(query);
        if (row.first()) {
             System.out.println("name is " + row.getString("name");
        }
    } catch (SQLException exc) {
        exc.printStackTrace();
    }

Please note, my original implementation did not contain

?useUnicode=yes&characterEncoding=UTF-8

but adding it did not make it better or worse. I added it as I thought this might be the culprit. I also tried latin1 instead of utf-8 to see if this made a difference but no luck, the result is exactly the same, i.e. it comes out as CÃ³ras Ã‰ireann.

I even tried things like:

                byte[] rowBytes = row.getBytes("name");
                String utfdecocedStr = new String(rowBytes, "UTF-8");
                System.out.println(utfdecocedStr);

but the output remains as CÃ³ras Ã‰ireann

In python, I do not do any decoding/encoding, the query and connection are all basic, and I get the correct string. Is there a step I am missing that needs to be done on the DB or Java side to get this to work? In my.conf, I do not have any settings to setup any encoding config.

Python approach:

  import MySQLdb
  cdb = MySQLdb.connect(host=LOCAL_MYSQL_HOST,port=LOCAL_MYSQL_PORT,
                         user=LOCAL_MYSQL_USER,
                         passwd=LOCAL_MYSQL_PASS,db=LOCAL_MYSQL_DB)
  ccursor = self.cdb.cursor()
  query = """SELECT name FROM customer WHERE id='%d' """%(customer_id)
  row = ccursor.execute(query)
  if row:
      customername = ccursor.fetchone()

Thank you...Amro

---------Update 20150811---------------

I ran the following commands on the database and I found the following configuration which probably explains the behavior I am seeing:

show variables like 'character%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        | 
| character_set_database   | latin1                                      |
| character_set_filesystem | binary                                      |
| character_set_results    | utf8                                        |
| character_set_server     | latin1                                      |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

By adding the following to my.cnf:

character-set-server = utf8
character-set-filesystem = utf8

Then it the tables evolve to:

mysql> show variables like 'character%';

+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | latin1                                      |
| character_set_filesystem | utf8                                        |
| character_set_results    | utf8                                        |
| character_set_server     | utf8                                        |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+


+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+

So I issued the command on the DB console:

 ALTER DATABASE <dbname> CHARACTER SET utf8;

show variables like 'character%';

+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | utf8                                        |
| character_set_filesystem | utf8                                        |
| character_set_results    | utf8                                        |
| character_set_server     | utf8                                        |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+

Unfortunately this did not resolve the issue still.

Can someone let me know how I can correct this hopefully without purging the database?

Firstly, you need to be sure that while storing the field, you are using the same UTF-8 format. Is data corrupt in DB itself ? You may try with `ISO-8859-1` in `new String(rowBytes, "ISO-8859-1")`. But where are you getting this as a output `CÃ³ras Ã‰ireann` on your server console or on your IDE console ? — Utkarsh, Aug 11 '15 at 05:10
How are you viewing the output in java? If this is through eclipse, could you try changing the encoding of the eclipse console at Run -> Run Configurations -> Common — rest_day, Aug 11 '15 at 05:11
@UtKarash, the data in the DB is fine, as I pointed out if I view the data by connecting to the mysql console or python, the encoding shows up properly. So the DB data is not corrupt. I am using IntelliJ as my IDE, and its set to UTF-8. The output CÃ³ras Ã‰ireann is dumped on my server console (terminal window) where I run my java program from and my python program from. — Amro Younes, Aug 11 '15 at 05:27
You seems to be doing all the right things. May be check this : http://stackoverflow.com/questions/5405236/jdbc-mysql-utf-8-string-writing-problem In above link `my.cnf` is for linux and `my.ini` for windows. Adding the character set `default-character-set=utf8` in these files resolves the issue sometimes. — Utkarsh, Aug 11 '15 at 05:48
@UTKarash, thanks I will give that a try and let you know. I have to see if I need to restart the DB to force a re-read of my.conf. The DB is deployed on a linux machine. — Amro Younes, Aug 11 '15 at 05:58

Amro Younes · Accepted Answer · 2015-08-19T20:23:17.067

First off I realized that the table I was using has encoding of UTF-8, but the source table feeding into it is latin1 which confirmed my suspicion of what I thought the encoding was.

The person who wrote the code to copy the data from one to another did no encoding translation, so I feel by that point the data had corrupted encoding.

I performed various experiments that included connecting to database with

?useUnicode=yes&characterEncoding=UTF-8

I also played with the following on the SQL console:

SET character_set_client=latin1;
SET character_set_connection=latin1;
SET character_set_database=latin1;
SET character_set_results=latin1;

as well as variations of playing with my.cnf and setting:

[mysqld] 
character-set-server = utf8 
character-set-filesystem = utf8

Anyway, none of this helped. So finally to prove that it is not a java issue but its a corrupted data set, I created my own table with UTF8 encoding, stored the names in there and had my program pull the data. It looked correct. So now its a matter of fixing the original table contents.

The only thing I could not explain is how the python program did not flag this as an issue. This is not the first time where I encountered the forgiveness of python over Java's strong typing which is a blessing and a pain in these cases.

==============================

Update 2015/08/19:

When I fixed the source table to UTF8 and stored the data properly, the Java code worked but the python code broke.

To fix it in python I merely added the

self.cdb = MySQLdb.connect(host=LOCAL_MYSQL_HOST,port=LOCAL_MYSQL_PORT,
                         user=LOCAL_MYSQL_USER, 
  passwd=LOCAL_MYSQL_PASS,
  db=LOCAL_MYSQL_DB,use_unicode=True,charset="UTF8")

This resolved it.

Unfortunately, when I tried to work around this earlier with the Java driver it did not work:

dbStr = String.format("jdbc:mysql://%s:%d/%s?useUnicode=yes&characterEncoding=latin1", esConfig.LOCAL_MYSQL_HOST, esConfig.LOCAL_MYSQL_PORT, esConfig.LOCAL_MYSQL_DB);

I am using java 1.7.80 and the jdbc package is mysql-connector-java-5.1.36-bin.jar, the latest I found from Oracle

score 1 · Answer 2 · answered Aug 28 '15 at 04:42

1

You have Mojibake.

The bytes you have in the client are correctly encoded in utf8 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
xx The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

If you need to fix for the data it takes a "2-step ALTER", something like

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

answered Aug 28 '15 at 04:42

Rick James

135,179
13
127
222

Thank you for the feedback, I have tried to modify the destination table before but I saw no difference on the Java side. My concern was that the source table was in latin1 and the destination table we copied into is UTF-8. Making both tables UTF-8 yielded the consistent results I wanted. – Amro Younes Aug 28 '15 at 18:21
How did you "copy"? One way works will translate latin1 to utf8; another way mangles the bytes. – Rick James Aug 28 '15 at 23:38
Unfotunately I don't have that information on how the copy happened. Another person used java and copied it without any translation. However as we all now, Java defaults to UTF-16 so I'm not sure if that contributed to the issue. It was not detected because using python or SQL console we never saw any corruption – Amro Younes Aug 31 '15 at 20:18
I think Java uses utf16 or ucs2 _internally_, but presents utf8 _externally_. `CÃ³ras Ã‰ireann` involves latin1, not utf16/ucs2. For further discussion, can you provide `SELECT col, HEX(col) FROM tbl WHERE ...` to see what was actually stored. – Rick James Aug 31 '15 at 23:48

String with encoding from MYSQL DB Query comes through as mangled in Java

2 Answers2