How do I identify if the string contains a special character which cannot be stored using a utf8-mb4 character set

Question

Refer to this tweet and the following thread were we are trying to store a similar tweet into the database. I am unable to store this tweet in MySQL, I would like to know how to identify, if the string contains a character which cannot be processed by the utf8-mb4 character set, so that I can avoid storing it.

You misunderstood something, `utf8-mb4` can store all Unicode characters currently supported. Reread the referred question. — axtavt, Jan 09 '12 at 09:44
Are you sure that the problem is with MySQL? Maybe it's with the MySQL driver or some such. — cha0site, Jan 11 '12 at 06:53
I've pretty much tried all options as for as MySQL is concerned, since I am unable to store I would like to know a way to find such strings so that I can avoid storing them. — priya, Jan 11 '12 at 08:28
What does MySQL say it is willing to store in such strings? Just 8 bit ASCII codes? In that case, the test is easy. If MySQL is willing to store Unicode, you shouldn't have a problem. If it stores something else... Unicode defines a wide variety of character classes, and some tools (we have one but it isn't easily accessed from MySQL environments) that implement corresponding predicates, so it possible to decide for any character code if it belongs to such Unicode classes. — Ira Baxter, Jan 11 '12 at 12:09
Is your table(s) default character set and text fields set to utf8mb4? — Zack Macomber, Jan 11 '12 at 13:56
@ZackMacomber - Yes default character set and text fields are set to utf8mb4. — priya, Jan 16 '12 at 11:07

score 4 · Answer 1 · answered Jan 11 '12 at 14:01

4

The character that poses a problem for you is U+1F603 SMILING FACE WITH OPEN MOUTH, which has a value not representable in 16 bits. When converted to UTF-8 the byte values are f0 9f 98 83, which should fit without issues in a utf8mb4 character set MySQL column, so I will agree with the other commenters that it doesn't look to be a MySQL issue. If you can attempt to re-insert this tweet, log all SQL statements as received by MySQL to determine if the characters get corrupted before or after sending them to MySQL.

answered Jan 11 '12 at 14:01

Tassos Bassoukos

16,017
2
36
40

I tried storing this tweet into MySQL on a utf8mb4 character set and but it seems to fail and I am unable to fix that issue, hence I would like to see if a string contains such a character at the application level, so that I can avoid storing such strings. – priya Jan 16 '12 at 11:13
2

We understand that this is your diagnostic, but we think it's wrong; therefore it would help if you could add more details to support or refute your root cause analysis. Do you get an error message? Can you post the generated SQL, as requested? – tripleee Jan 17 '12 at 07:45
2

priya, if you want to simply check the tweets, the way is easy - check to see if any character in the tweet has a UTF-8 representation larger than 3 bytes. However, as @tripleee mentions, we believe that MySQL probably is not at fault here. – Tassos Bassoukos Jan 18 '12 at 14:38

score 1 · Answer 2 · answered Jan 17 '12 at 06:01

Instead of finding the special character of the string you can do one thing you can convert the string into Hex format and then back you can convert that into previous string

public static synchronized String toHex(byte [] buf){
    StringBuffer strbuf = new StringBuffer(buf.length * 2);
    int i;
    for (i = 0; i < buf.length; i++) {
        if (((int) buf[i] & 0xff) < 0x10){
            strbuf.append("0");
        }
        strbuf.append(Long.toString((int) buf[i] & 0xff, 16));
    }
    return strbuf.toString();
}

By using the below function you can convert back to original string

public synchronized static byte[] hexToBytes(String hexString) {
    HexBinaryAdapter adapter = new HexBinaryAdapter();
    byte[] bytes = adapter.unmarshal(hexString);
    return bytes;
}

I'm sorry, but this is a very haphazard way of doing stuff which would explode your performance. You lose the possibility to do string lookups as well. — parasietje, Jan 17 '12 at 10:50
Yes, I agree with you but I have explained the way of doing the thing, If you are not agree then its ok, but doesn't need to -1 the reputation. — Bhavik Ambani, Jan 17 '12 at 12:30

score 0 · Answer 3 · answered Jan 18 '12 at 00:21

0

If you want to avoid storing troublesome characters (the rare fancy characters outside the Basic Multilingual Plane, that give you problems), you can parse the String's characters and discard the String if it contains codepoints for which Character.charCount returns 2, or for which Character.isSupplementaryCodePoint returns true.

This way, as you asked, you can avoid storing those strings that (for some reason) your DBMS has trouble with.

Sources: see javadoc for

Character.charCount
Character.isSupplementaryCodePoint

and, while you're at it

String.codePointAt
String.codePointCount

answered Jan 18 '12 at 00:21

Unai Vivi

3,073
3
30
46

for (int j = 0; j < text.length(); j++) { int codePoint = text.charAt(j); int count = Character.charCount(codePoint); log.trace("Character count for codePoint = {} is = {}", codePoint, count); } – priya Jan 19 '12 at 04:22
I tried the above code, but none of the count values were greater than 1, not sure if I am doing this correctly. – priya Jan 19 '12 at 04:23
Hmm... try something like `for(int j=0;j1;System.out.println(isBadSymbol1_A+" "+isBadSymbol_B)}` – Unai Vivi Jan 19 '12 at 10:35
the above code doesn't return true for the tweet in https://twitter.com/#!/Sol_Floresita17/status/162857472661524480 – priya Jan 27 '12 at 12:12
How do you import your twitter message into the `text` string? It might be that you're losing information in the process, e.g. passing through a step that doesn't deal with unicode – Unai Vivi Jan 27 '12 at 12:18
I tested this within the context of my twitter client, I didn't run this standalone. So I am pretty sure the string which causes the problem was used by the above code snippet as input. – priya Jan 28 '12 at 04:33

How do I identify if the string contains a special character which cannot be stored using a utf8-mb4 character set

3 Answers3