UTF-8 Character in Java becomes Invalid Character in MySQL

Question

Have a value which comes from external REST / JSON based datasource as a special character. I convert it using a pre-existing utility, CharDecoder.java, it stays the same, but after inserting it into a MySQL database (which default charset is UTF-8) it turns from ć to ?.

The flow of my program is this:

External Datasource sends JSON --> CharDecoder (inside a war file in tomcat7, handles special chars) then populates a row --> inside a MySQL database.

The end result in the MySQL database is an invalid character.

Dev environment information:

Am using Java 1.7.
Maven 3.3.3, inside my pom.xml's <properties> tag:

< project.build.sourceEncoding > UTF-8 < /project.build.sourceEncoding >
Eclipse Oxygen on MacOS.

Am running Eclipse Oxygen on macOS - inside the Project's properties view (click on the project and ⌘I also known as COMMAND+I), it states that the text file encoding is UTF-8.

When I convert it using a utility class that in the codebase, it works, but when updating the row in a MySQL database (which table's default charset is UTF-8) it becomes an invalid character.

So, I added this character to my chars array: "ć" (its located in the same row that starts with "î").

public class CharDecoder {
    
    public final static String chars [] = 
    {
        "ö", "ä", "ü", "Ö", "Ä", "Ü", "ß",
        "?", "\\", ",", ":", ";", "#", "+", "~", "!", "\"", "§", "$", "%",
        "&", "(", ")", "=", "<", ">", "{", "[", "]", "}", "/", "â", "ê",
        "î", "ô", "û", "Â", "Ê", "Î", "Ô", "Û", "á","ć", "é", "í", "ó", "ú",
        "Á", "É", "Í", "Ó", "Ú", "à", "è", "ì", "ò", "ó", "ù", "Á", "É", "Í",
        "Ó", "Ú", "°", "³", "²", "€", "|", "^", "`", "´", "'", " ", "@",
        "~", "*"
    };

    public final static String charsHtml[] = 
    { 
        "ö", "ä", "ü", "Ö", "Ä", "Ü",
        "ß", "?", "\\", ",", ":", ";", "#", "+", "&tilde;", "!", "\"",
        "&sect;", "$", "%", "&amp;", "(", ")", "=", "&lt;", "&gt;", "{",
        "[", "]", "}", "/", "&acirc;", "&ecirc;", "&icirc;", "&ocirc;",
        "&ucirc;", "&Acirc;", "&Ecirc;", "&Icirc;", "&Ocirc;", "&Ucirc;",
        "&aacute;", "&eacute;", "&iacute;", "&oacute;", "&uacute;",
        "&Aacute;", "&Eacute;", "&Iacute;", "&Oacute;", "&Uacute;",
        "&agrave;", "&egrave;", "&igrave;", "&ograve;", "&Ugrave;",
        "&Agrave;", "&Egrave;", "&Igrave;", "&Ograve;", "&Ugrave;",
        "&deg;", "&sup3;", "&sup2;", "&euro;", "|", "&circ;", "`",
        "&acute;", "'", " ", "@", "~", "*"
    };
    
    public final static String entities[] = { 
        "F6", "E4", "FC", "D6", "C4",
        "DC", "DF", "3F", "5C", "2C", "3A", "3B", "23", "2B", "7E", "21",
        "22", "A7", "24", "25", "26", "28", "29", "3D", "3C", "3E", "7B",
        "5B", "5D", "7D", "2F", "E2", "EA", "EE", "F4", "FB", "C2", "CA",
        "CE", "D4", "DB", "E1", "E9", "ED", "F3", "FA", "C1", "C9", "CD",
        "D3", "DA", "E0", "E8", "EC", "F2", "F9", "C1", "C9", "CD", "D3",
        "DA", "B0", "B3", "B2", "80", "7C", "5E", "60", "B4", "27", "20",
        "40", "98", "2A"
    };

    public static String inputToChar(String input) {
        return (inputTo(input, chars));
    }

    public static String inputTo(String input, String[] tc) {
        StringBuilder sb = new StringBuilder();
        boolean entity = false;
        input = input.replace ('+', ' ');
        String tokens = tc == charsHtml ? "%<>" : "%";
        for (StringTokenizer st = new StringTokenizer (input, tokens, true); st.hasMoreTokens(); ) {
        String token = st.nextToken();
        if (entity) {
            boolean replaced = false;
            for (int i = 0; i < entities.length; i++) {
                if (token.startsWith (entities[i])) {
                    sb.append (tc[i]);
                    sb.append (token.substring (2));
                    replaced = true;
                    break;
                }
            }
            if (!replaced) {
                sb.append (token);
            }
            entity = false;
         } 
         else if (token.equals ("%")) {
            entity = true;
            continue;
         } 
         else if (token.equals ("<")) {
            sb.append ("&lt;");
         } 
         else if (token.equals (">")) {
            sb.append ("&gt;");
         } 
         else {
            sb.append (token);
         }
      }
      return (sb.toString ());
   }

   public static void main(String [] args) {
        String person1 = CharDecoder.inputToChar("Lukić");
        System.out.println(person1);
   }
}

In order to make this question more straightforward, I removed the JDBC code (a simple JDBC Update Query) just created a main() method. When I run this main() method the output is:

Lukić

This is fine and what I want. However, when I update it using Spring JDBC, in the MySQL database (which table's default charset is UTF-8) it becomes:

Luki?

This definitely happens from the database side, should I change it (the table's default charset to LATIN1)?

Would I have to change the entire database's default charset to LATIN1? Am only throwing ideas out there...

Is there a way to fix this without changing the default charset (don't want to corrupt any existing data)...

Where is your method `entityTo` ? And you have a typo on main definition `public static main void` should be `public static void main` — Jorge Campos, Feb 17 '18 at 03:51
I tested your code using `CharDecoder.inputToChar("Lukić");` and it is working fine here. I suspect that your project setup is not configured as UTF-8 (if you are in eclipse select the class or project and hit CTRL+ENTER) — Jorge Campos, Feb 17 '18 at 03:53
@JorgeCampos - Thanks, it was inputToChar("("Lukić"). It is setup as UTF-8 inside maven. Sorry, I am using Eclipse on macOS so CNTRL+ENTER) doesn't do anything for me. I'll edit and add the maven right now. — PacificNW_Lover, Feb 17 '18 at 03:56
Ok, it may be an issue with your setup environment, take a look at this thread: https://stackoverflow.com/questions/4606570/os-x-terminal-utf-8-issues By the way, what is your final goal with those convertions? There are Normalization classes for it. Take a look here: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html — Jorge Campos, Feb 17 '18 at 04:04
@JorgeCampos Its not an macOS issue - its a MySQL issue because MySQL is running on MacOS and Linux and the same thing happens in Linux. — PacificNW_Lover, Feb 17 '18 at 04:21

score 1 · Accepted Answer · answered Feb 17 '18 at 06:44

1

useUnicode=yes&characterEncoding=UTF-8

put this to your database url

answered Feb 17 '18 at 06:44

Muthu

156
8

`?zeroDateTimeBehavior=convertToNull&useUnicode=true&characterEncoding=UTF-8` - Thanks your suggestion helped me find the solution. – PacificNW_Lover Feb 19 '18 at 20:57

score 0 · Answer 2 · answered Feb 17 '18 at 09:02

0

If you want to have FULL Unicode support, you might as well go the whole way:

character_set_server=utf8mb4

See: What is the difference between utf8mb4 and utf8 charsets in mysql?

answered Feb 17 '18 at 09:02

Mick

954
7
17

UTF-8 Character in Java becomes Invalid Character in MySQL

2 Answers2