7

Day old problem, when returning a query from the MySql database I get characters like ç instead of ç.

I'm using a really simple class to try to pin the problem down:

package com.dataTest;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;

public class EncodingTest {

    public static void main(String[] args) throws Exception {
        Class.forName("com.mysql.jdbc.Driver");

        Connection connection = DriverManager
                .getConnection("jdbc:mysql://internalip:3306/databasename?" + 
                               "user=user" +
                               "&password=password" + 
                               "&characterEncoding=utf-8" + 
                               "&useUnicode=yes");

        PreparedStatement stmt =
                connection.prepareStatement("SELECT * FROM `databasename`.teste_json;");
        ResultSet rs = stmt.executeQuery();

        while (rs.next()) {
            System.out.println(rs.getString("info"));
        }

        rs.close();
        stmt.close();
        connection.close();
    }
}

The information was added into the database via other class, but I changed it manually using the Workbench, problem remains.

The database collation is utf8_general_ci and the table encoding to utf8.

I really don't know what to do.

EDIT:

I copied the JSON string to a varchar(1500) field and it prints perfectly. I forgot to mention, the field mentioned in the query (info) is a JSON field.

EDIT2: It's not mojibake?

Question was closed based on Rick James answer claiming it was Mojibake.

Acording to the question Trouble with utf8 characters; what I see is not what I stored the following observations:

  • The bytes to be stored need to be UTF-8-encoded. Fix this.

I think they are stored correctly. I made a simple JSON for the HEX, LENGTH and CHAR_LENGTH like this:

'[{\"é\": \"\"}]', '5B7B22C3A9223A2022227D5D', '12', '11'
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.

The connection string specifies the charset:

&useUnicode=yes&characterEncoding=UTF-8
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.

Well, the database has charset/collation = utf8/utf8_general_ci The table has collation utf8_unicode_ci The JSON column (which is the problematic one) has no collation.

How odd.

The manual states

MySQL handles strings used in JSON context using the utf8mb4 character set and utf8mb4_bin collation. Strings in other character sets are converted to utf8mb4 as necessary. (For strings in the ascii or utf8 character sets, no conversion is needed because ascii and utf8 are subsets of utf8mb4.)

From https://dev.mysql.com/doc/refman/5.7/en/json.html

Doesn't this mean that the actual collation is already correct?

From what I understand, the information inserted in the server is correct (see HEX information) so the problem is selecting? But the connection string seems to be correct in all forms.

EDIT: NOT A DUPLICATE.

The issues in this case ARE NOT THE SAME in the other question, see my own answer to this question and the bug report mentioned.

Community
  • 1
  • 1
Johnny Bigoode
  • 578
  • 10
  • 31

3 Answers3

8

TLDR: Use convert USING one of the following utf8mb4, utf16 or utf32.

As commented by user pvg, MySql seems to store their JSON formats in a utf8 character type (utf8mb4, utf8, ascii), at least that's that their documentation says. According to this bug report (https://bugs.mysql.com/bug.php?id=81677), JSON are stored in utf8mb4, but the encoding is not supported by the current JDBC driver (as in April/2017, that's the artifact mysql-connector-java, version 6.0.6).

Interestingly enough, MySql has a CONVERT() function that can be used like this CONVERT(info USING utf8)

Changing my query to SELECT field1, field2, CONVERT(info USING utf8) as info FROM databasename.teste_json; completely fixed my issue.

Here are some results with the different encoding types in MySql.

I changed the field name for something more complex just to check what would happen. The string I used was ãõêçé日本語のキーボード

In the Java program I removed all encoding references in the connection string, these were the results using CONVERT using several encodings found in MySql documentation (https://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html):

big5     {"name": "??????日本語のキ?ボ?ド"}
dec8     {"name": "?ãõêçé?????????"}
cp850    {"name": "?ãõêçé?????????"}
hp8      {"name": "?âêÁµÅ?????????"}
koi8r    {"name": "???????????????"}
latin1   {"name": "?ãõêçé?????????"}
latin2   {"name": "????çé?????????"}
swe7     {"name": "?????`?????????"}
ascii    {"name": "???????????????"}
ujis     {"name": "?ãõêçé日本語のキーボード"}
sjis     {"name": "??????日本語のキーボード"}
hebrew   {"name": "???????????????"}
tis620   {"name": "???????????????"}
euckr    {"name": "??????日本語のキ?ボ?ド"}
koi8u    {"name": "???????????????"}
gb2312   {"name": "???ê?é日本?のキ?ボ?ド"}
greek    {"name": "???????????????"}
cp1250   {"name": "????çé?????????"}
gbk      {"name": "???ê?é日本語のキーボード"}
latin5   {"name": "?ãõêçé?????????"}
armscii8 {"name": "???????????????"}
utf8     {"name": "?ãõêçé日本語のキーボード"}
ucs2     {"name": "?ãõêçé日本語のキーボード"}
cp866    {"name": "???????????????"}
keybcs2  {"name": "?????é?????????"}
macce    {"name": "??õ??é?????????"}
macroman {"name": "?ãõêçé?????????"}
cp852    {"name": "????çé?????????"}
latin7   {"name": "??õ??é?????????"}
utf8mb4  {"name": "ãõêçé日本語のキーボード"}
cp1251   {"name": "???????????????"}
utf16    {"name": "ãõêçé日本語のキーボード"}
cp1256   {"name": "???êçé?????????"}
cp1257   {"name": "??õ??é?????????"}
utf32    {"name": "ãõêçé日本語のキーボード"}
binary   {"name": "ðÃ£ÃµÃªÃ§Ã©æ¥æ¬èªã®ã­ã¼ãã¼ã"}
geostd8  {"name": "???????????????"}
cp932    {"name": "??????日本語のキーボード"}
eucjpms  {"name": "?ãõêçé日本語のキーボード"}

Please note that the only encodings that managed to show correctly all characters, including the smiley face were utf8mb4, utf16, utf32.

EDIT: From all the discussion in the other post, it seems to be some kind of error in the JDBC driver, at least everything points to it, the currently possible workaround is to force CONVERT using utf8mb4. I strongly recommend reading through Rick James material on the subject if you wish to understand more about encoding problems.

Johnny Bigoode
  • 578
  • 10
  • 31
  • I'm curious what the advantage of using the JSON type is. It seems like you're guaranteed to perform _two_ conversions per JSON document reaching your Java client. What do you get from JSON that VARCHAR doesn't give you? – pvg Apr 11 '17 at 18:38
  • In this case the only advantage is that there's a JSON validation at database level. In future cases you get access to some JSON functions https://dev.mysql.com/doc/refman/5.7/en/json-functions.html .I do think that I should've weighted my options before using JSON, but it has been a really interesting experiment and I'm learning a lot about it. – Johnny Bigoode Apr 11 '17 at 21:00
  • You know... your right! I kinda dismissed his reply because I was 100% absolutely really sure there was some sort of problem in the jdbc connection or java code. Gonna take a long closer look into that answer and see if it solves my problem. – Johnny Bigoode Apr 11 '17 at 21:21
  • Yes the answer is very unfortunately phrased. This is not 'Mojibake'. This is 'mysql is broken in a non-obvious, poorly documented way'. The key facts are - mysql utf8 is not utf8. And the made-up mysql encoding name utf8mb4 is actually utf8. This should be in flashing red letters somewhere but oddly, it is not :) – pvg Apr 11 '17 at 21:55
  • Well, to be honest I was retracing all steps and checking Rick James guides on the matter, utf8mb4 is not actually utf8 because it seems that Java doesn't like it. At least that's the conclusion I'm getting. I'm just wondering if I can re-open this question, since it's now closed as a duplicate... really hoped for a better suggestion. – Johnny Bigoode Apr 12 '17 at 12:35
  • What do you get when you leave out the call to CONVERT? – Rick James Apr 12 '17 at 14:22
  • `JSON` vs `VARCHAR`: In newer Oracle MySQL versions, there are a lot of JSON functions for dissecting the string. Otherwise, quite similar. – Rick James Apr 12 '17 at 14:25
  • If I leave out the call to CONVERT I'll get the same issue. – Johnny Bigoode Apr 12 '17 at 15:13
2

That's "Mojibake". The problem and solution is discussed in

Trouble with UTF-8 characters; what I see is not what I stored

Java: Note that it is spelled UTF-8, not utf8, nor utf8mb4:

Add useUnicode=true&characterEncoding=UTF-8 to the JDBC URL

<%@ page language="java" pageEncoding="UTF-8"%>
<%@ page contentType="text/html;charset=UTF-8" %>

compileJava.options.encoding = 'UTF-8'
<form method="post" action="/your/url/" accept-charset="UTF-8">
Cugomastik
  • 911
  • 13
  • 22
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • This seems more like the question is a dupe for your existing answer rather than a new answer. – pvg Apr 11 '17 at 20:50
  • Really don't think it's Mojibake, see updated answer. – Johnny Bigoode Apr 12 '17 at 12:25
  • No resolution. And yes, I'm using the `characterEncoding=UTF-8` parameter – Johnny Bigoode Apr 12 '17 at 15:12
  • What action produces `ç`? Until this is cleared up, I claim it is Mojibake. – Rick James Apr 12 '17 at 15:36
  • SQL query SELECT * FROM `databasename`.teste_json; should return {"": "ç"} when in fact it returns {"": "ç"} – Johnny Bigoode Apr 12 '17 at 17:34
  • Also, SELECT info, HEX(info), LENGTH(info), CHAR_LENGTH(info) FROM databasename.teste_json; returns {"": "ç"} 7B22223A2022C3A7227D 10 9, which seems to be correct, am I missing anything? – Johnny Bigoode Apr 12 '17 at 17:35
  • What version of what browser on what OS? – Rick James Apr 13 '17 at 04:45
  • Look in the Browser for the Text Encoding. Does it say UTF-8? – Rick James Apr 13 '17 at 04:52
  • I think you did everything "right", but the Browser (for whatever reason) is interpreting the page as "Western (Windows-1252)". – Rick James Apr 13 '17 at 04:54
  • There is no browser, as I stated in the original question, this is pure Java, the data I'm getting is from the Eclipse console. Could that be the issue? Can I set the encoding for System.out.println()? – Johnny Bigoode Apr 13 '17 at 14:43
  • Then let's dissect the Eclipse console -- I _think_ it is treating bytes as latin1 (cp1252) instead of utf8. The `SELECT HEX, etc` says that it is (1) correctly stored, and (2) correctly interpreted when fetching. So, I deduce that the rendering engine (Eclipse) is at fault. – Rick James Apr 13 '17 at 14:48
  • I changed System.out.println to PrintStream with "UTF-8", ç still prints as ç. I thought about changing the String encoding, but it seems that it's UTF-16 and can't be changed, so the problem can't be there. After reading the bug report on JSON again I decided to do the obvious, compare hexadecial information. I was printing using DatatypeConverter.printHexBinary(r.getBytes()) and the hex DID point to something wrong, the string that Java was getting seems to be a little larger than it should: 7B22223A2022C383C2A7227D vs 7B22223A2022C3A7227D. – Johnny Bigoode Apr 13 '17 at 16:10
  • Anyways, I'm 100% sure that there's a driver problem, so I'm gonna close this with my own reply stating that CONVERT USING is the only way out, but I really appreciate all the information and knowledge shared. – Johnny Bigoode Apr 13 '17 at 16:11
  • `C3 83 C2 A7` is "double encoding". Those 4 bytes, if treated as latin1, represent `ç`. If treated as utf8, they represent `ç`. In _latin1_ `ç` is hex `C3 A7`. That, in turn, when treated as utf8 is `ç`. To get double-encoding, the `INSERT` has to be screwed up. Previously you had no clue of that. – Rick James Apr 13 '17 at 21:44
0

Same issue here, changing the connector from mysql:mysql-connector-java:5.1.39 to mysql:mysql-connector-java:5.1.44 fixed it.

ToYonos
  • 16,469
  • 2
  • 54
  • 70