1

String with encoding from MYSQL DB Query comes through as mangled in Java

Using Java, I am connecting to a MYSQL database where I query a table to glean a field from it.

The field has UTF8 encoded text, that is:

Córas Éireann

The above text is what I see when I log in to the MYSQL console and look at that row in the table. The mysql version deployed is: mysql Ver 14.14 Distrib 5.1.52, for unknown-linux-gnu (x86_64) using readline 5.1

If I use a python program to connect to the same database and query the same table and get the same row, the text looks correct, i.e. it comes as Córas Éireann

However, when I query it through Java, the text comes out as:

Córas Éireann

which I suspect is Western (ISO latin1), but I am not sure, just a guess.

I did a show table status and saw that the table I am querying has Collation of utf8_general_ci

The table I am querying does not have any adddtionaly encoding definitions, the field, name, is a varchar(512).

+--------------------+--------------+------+-----+---------+----------------+
| Field              | Type         | Null | Key | Default | Extra          |
+--------------------+--------------+------+-----+---------+----------------+
| id                 | int(11)      | NO   | PRI | NULL    | auto_increment |
| name               | varchar(512) | YES  |     | NULL    |                |

The SQL connect JAR I imported is mysql-connector-java-5.1.36, I also tried v 5.1.34 and 5.0.8 but that made no difference.

This is how I connect to the DB:

    String dbStr =
    String.format("jdbc:mysql://%s:%d/%s?useUnicode=yes&characterEncoding=UTF-8", LOCAL_MYSQL_HOST,
                                       LOCAL_MYSQL_PORT, LOCAL_MYSQL_DB);
    try {
        cdb = DriverManager.getConnection(dbStr, LOCAL_MYSQL_USER, LOCAL_MYSQL_PASS);
        Statement dbStatement = cdb.createStatement();
        String query = String.format("SELECT name FROM customer WHERE id=%d",customerId);
        ResultSet row = dbStatement.executeQuery(query);
        if (row.first()) {
             System.out.println("name is " + row.getString("name");
        }
    } catch (SQLException exc) {
        exc.printStackTrace();
    }

Please note, my original implementation did not contain

?useUnicode=yes&characterEncoding=UTF-8

but adding it did not make it better or worse. I added it as I thought this might be the culprit. I also tried latin1 instead of utf-8 to see if this made a difference but no luck, the result is exactly the same, i.e. it comes out as Córas Éireann.

I even tried things like:

                byte[] rowBytes = row.getBytes("name");
                String utfdecocedStr = new String(rowBytes, "UTF-8");
                System.out.println(utfdecocedStr);

but the output remains as Córas Éireann

In python, I do not do any decoding/encoding, the query and connection are all basic, and I get the correct string. Is there a step I am missing that needs to be done on the DB or Java side to get this to work? In my.conf, I do not have any settings to setup any encoding config.

Python approach:

  import MySQLdb
  cdb = MySQLdb.connect(host=LOCAL_MYSQL_HOST,port=LOCAL_MYSQL_PORT,
                         user=LOCAL_MYSQL_USER,
                         passwd=LOCAL_MYSQL_PASS,db=LOCAL_MYSQL_DB)
  ccursor = self.cdb.cursor()
  query = """SELECT name FROM customer WHERE id='%d' """%(customer_id)
  row = ccursor.execute(query)
  if row:
      customername = ccursor.fetchone()

Thank you...Amro

---------Update 20150811---------------

I ran the following commands on the database and I found the following configuration which probably explains the behavior I am seeing:

show variables like 'character%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        | 
| character_set_database   | latin1                                      |
| character_set_filesystem | binary                                      |
| character_set_results    | utf8                                        |
| character_set_server     | latin1                                      |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation%';
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)

By adding the following to my.cnf:

character-set-server = utf8
character-set-filesystem = utf8

Then it the tables evolve to:

mysql> show variables like 'character%';

+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | latin1                                      |
| character_set_filesystem | utf8                                        |
| character_set_results    | utf8                                        |
| character_set_server     | utf8                                        |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+


+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+

So I issued the command on the DB console:

 ALTER DATABASE <dbname> CHARACTER SET utf8;

show variables like 'character%';

+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | utf8                                        |
| character_set_connection | utf8                                        |
| character_set_database   | utf8                                        |
| character_set_filesystem | utf8                                        |
| character_set_results    | utf8                                        |
| character_set_server     | utf8                                        |
| character_set_system     | utf8                                        |
| character_sets_dir       | /usr/share/percona-xtradb-cluster/charsets/ |
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)

mysql> show variables like 'collation%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+

Unfortunately this did not resolve the issue still.

Can someone let me know how I can correct this hopefully without purging the database?

Amro Younes
  • 1,261
  • 2
  • 16
  • 34
  • Firstly, you need to be sure that while storing the field, you are using the same UTF-8 format. Is data corrupt in DB itself ? You may try with `ISO-8859-1` in `new String(rowBytes, "ISO-8859-1")`. But where are you getting this as a output `Córas Éireann` on your server console or on your IDE console ? – Utkarsh Aug 11 '15 at 05:10
  • How are you viewing the output in java? If this is through eclipse, could you try changing the encoding of the eclipse console at Run -> Run Configurations -> Common – rest_day Aug 11 '15 at 05:11
  • @UtKarash, the data in the DB is fine, as I pointed out if I view the data by connecting to the mysql console or python, the encoding shows up properly. So the DB data is not corrupt. I am using IntelliJ as my IDE, and its set to UTF-8. The output Córas Éireann is dumped on my server console (terminal window) where I run my java program from and my python program from. – Amro Younes Aug 11 '15 at 05:27
  • You seems to be doing all the right things. May be check this : http://stackoverflow.com/questions/5405236/jdbc-mysql-utf-8-string-writing-problem In above link `my.cnf` is for linux and `my.ini` for windows. Adding the character set `default-character-set=utf8` in these files resolves the issue sometimes. – Utkarsh Aug 11 '15 at 05:48
  • @UTKarash, thanks I will give that a try and let you know. I have to see if I need to restart the DB to force a re-read of my.conf. The DB is deployed on a linux machine. – Amro Younes Aug 11 '15 at 05:58

2 Answers2

1

First off I realized that the table I was using has encoding of UTF-8, but the source table feeding into it is latin1 which confirmed my suspicion of what I thought the encoding was.

The person who wrote the code to copy the data from one to another did no encoding translation, so I feel by that point the data had corrupted encoding.

I performed various experiments that included connecting to database with

?useUnicode=yes&characterEncoding=UTF-8

I also played with the following on the SQL console:

SET character_set_client=latin1;
SET character_set_connection=latin1;
SET character_set_database=latin1;
SET character_set_results=latin1;

as well as variations of playing with my.cnf and setting:

[mysqld] 
character-set-server = utf8 
character-set-filesystem = utf8

Anyway, none of this helped. So finally to prove that it is not a java issue but its a corrupted data set, I created my own table with UTF8 encoding, stored the names in there and had my program pull the data. It looked correct. So now its a matter of fixing the original table contents.

The only thing I could not explain is how the python program did not flag this as an issue. This is not the first time where I encountered the forgiveness of python over Java's strong typing which is a blessing and a pain in these cases.

==============================

Update 2015/08/19:

When I fixed the source table to UTF8 and stored the data properly, the Java code worked but the python code broke.

To fix it in python I merely added the

self.cdb = MySQLdb.connect(host=LOCAL_MYSQL_HOST,port=LOCAL_MYSQL_PORT,
                         user=LOCAL_MYSQL_USER, 
  passwd=LOCAL_MYSQL_PASS,
  db=LOCAL_MYSQL_DB,use_unicode=True,charset="UTF8")

This resolved it.

Unfortunately, when I tried to work around this earlier with the Java driver it did not work:

dbStr = String.format("jdbc:mysql://%s:%d/%s?useUnicode=yes&characterEncoding=latin1", esConfig.LOCAL_MYSQL_HOST, esConfig.LOCAL_MYSQL_PORT, esConfig.LOCAL_MYSQL_DB);

I am using java 1.7.80 and the jdbc package is mysql-connector-java-5.1.36-bin.jar, the latest I found from Oracle

Amro Younes
  • 1,261
  • 2
  • 16
  • 34
1

You have Mojibake.

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • xx The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
  • The column in the tables may or may not have been CHARACTER SET utf8, but it should have been that.

If you need to fix for the data it takes a "2-step ALTER", something like

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thank you for the feedback, I have tried to modify the destination table before but I saw no difference on the Java side. My concern was that the source table was in latin1 and the destination table we copied into is UTF-8. Making both tables UTF-8 yielded the consistent results I wanted. – Amro Younes Aug 28 '15 at 18:21
  • How did you "copy"? One way works will translate latin1 to utf8; another way mangles the bytes. – Rick James Aug 28 '15 at 23:38
  • Unfotunately I don't have that information on how the copy happened. Another person used java and copied it without any translation. However as we all now, Java defaults to UTF-16 so I'm not sure if that contributed to the issue. It was not detected because using python or SQL console we never saw any corruption – Amro Younes Aug 31 '15 at 20:18
  • I think Java uses utf16 or ucs2 _internally_, but presents utf8 _externally_. `Córas Éireann` involves latin1, not utf16/ucs2. For further discussion, can you provide `SELECT col, HEX(col) FROM tbl WHERE ...` to see what was actually stored. – Rick James Aug 31 '15 at 23:48