1

We have older mySQL DB that only support UTF-8 charset. Is a there a way in Java to detect if a given string will be UTF-8 compatible?

Saqib Ali
  • 3,953
  • 10
  • 55
  • 100
  • 4
    there is no such thing as "Mysql utf-8". UTF-8 is a standard in and of itself. Either something is UTF-8/unicode aware, or it is not. – Marc B Feb 18 '14 at 21:53
  • 1
    Not only the database side and the java side should be UTF-8 capable (both are), but also the communication via the JDBC driver must be set in case of MySQL, see [here](http://stackoverflow.com/questions/13359683/how-to-use-useunicode-yes-characterencoding-utf-8-with-dbcp) – Joop Eggen Feb 19 '14 at 14:19
  • @JoopEggen. thanks. we checked the JDBC driver and it is setup properly. The issue when we encounter an UTF8MB4 string which the older versions of mySQL can't handle. – Saqib Ali Feb 19 '14 at 15:15

3 Answers3

2
public static boolean isUTF8MB4(String s) {
    for (int i = 0; i < s.length(); ++i) {
        int bytes = s.substring(i, i + 1).getBytes(StandardCharsets.UTF_8);
        if (bytes > 3) {
            return true;
        }
    }
    return false;
}

The above implementation seems best, but otherwise:

public static boolean isUTF8MB4(String s) {
    for (int i = 0; i < s.length(); ) {
        int codePoint = s.codePointAt(i);
        int bytes = Character.charCount(codePoint);
        if (bytes > 3) {
            return true;
        }
        i += bytes;
    }
    return false;
}

which might fail more often.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Thanks Joop. But we already have the code to detect UTF8MB4: http://stackoverflow.com/a/21478020/420558 . What we are looking for is a tell if a string is UTF8 compatible or not. – Saqib Ali Feb 19 '14 at 18:26
  • Java String is Unicode, and has no UTF-8 problems. I was under the impression that old MySQL could handle UTF-8 as long as it encoded a Unicode code point (character) to no more than 3 bytes. The [wikipedia](http://en.wikipedia.org/wiki/UTF-8) says code points starting with U+10000 (java 0x10000) are problematic. – Joop Eggen Feb 19 '14 at 18:39
  • Hello Joop. Sounds like we need to rephrase the question... what we want to say is, our old mySQL database doesn't support all character sets (for example: UTF8MB4, but others as well). So, in java -- before we try to insert a string which will cause an mySQL to throw an exception -- we want to determine if the string is UTF8 compatible or not. If it is not, then we handle the situation before doing the insert. So, how do we determine if the string is UTF8 compatible? – Saqib Ali Feb 19 '14 at 19:49
  • So "UTF8" is _in this context_ the name of the old MySQL UTF-8 implementation. Which is the latest standard UTF-8 minus the UTF8MB4 specific part (multi-byte sequences more than 3 bytes per Unicode character). – Joop Eggen Feb 20 '14 at 10:19
0

Every String is UTF-8 compatible. Just set encoding in the database and the MySQL driver correctly and you're set.

The only gotcha is that the length in bytes of the UTF-8 encoded string may be larger that what .length() says. Here's a Java implementation of a function to measure how many bytes a string will take after encoding to UTF-8.

EDIT: Since Saqib pointed out that older MySQL doesn't actually support UTF-8, but only its BMP subset, you can check if a string contains codepoints outside BMP with string.length()==string.codePointCount(0,string.length()) ("true" means "all codepoints are in BMP") and remove them with string.replaceAll("[^\u0000-\uffff]", "")

Community
  • 1
  • 1
Karol S
  • 9,028
  • 2
  • 32
  • 45
  • i don't that is correct. you can't store UTF8MB4 string in the older mySQL DBs that only supported UTF8. – Saqib Ali Feb 19 '14 at 15:09
  • `Every String is UTF-8 compatible` <- not correct,`libidn2` has the function `u8_check` to check if a binary string is valid UTF-8 or not – James Stevens Aug 21 '20 at 15:24
0

MySQL defines:

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.

Therefore this function should work:

private boolean isValidUTF8(final String string) {
    for (int i = 0; i < string.length(); i++) {
        final char c = string.charAt(i);
        if (!Character.isBmpCodePoint(c)) {
            return false;
        }
    }
    return true;
 }
Kuu
  • 117
  • 1
  • 12