Refer to this tweet and the following thread were we are trying to store a similar tweet into the database. I am unable to store this tweet in MySQL, I would like to know how to identify, if the string contains a character which cannot be processed by the utf8-mb4 character set, so that I can avoid storing it.
-
4You misunderstood something, `utf8-mb4` can store all Unicode characters currently supported. Reread the referred question. – axtavt Jan 09 '12 at 09:44
-
it still isn't working for me with mb4, what should i do? – priya Jan 09 '12 at 12:45
-
appreciate any inputs on identifying the character. – priya Jan 10 '12 at 03:38
-
1Are you sure that the problem is with MySQL? Maybe it's with the MySQL driver or some such. – cha0site Jan 11 '12 at 06:53
-
I've pretty much tried all options as for as MySQL is concerned, since I am unable to store I would like to know a way to find such strings so that I can avoid storing them. – priya Jan 11 '12 at 08:28
-
1What does MySQL say it is willing to store in such strings? Just 8 bit ASCII codes? In that case, the test is easy. If MySQL is willing to store Unicode, you shouldn't have a problem. If it stores something else... Unicode defines a wide variety of character classes, and some tools (we have one but it isn't easily accessed from MySQL environments) that implement corresponding predicates, so it possible to decide for any character code if it belongs to such Unicode classes. – Ira Baxter Jan 11 '12 at 12:09
-
1Is your table(s) default character set and text fields set to utf8mb4? – Zack Macomber Jan 11 '12 at 13:56
-
@ZackMacomber - Yes default character set and text fields are set to utf8mb4. – priya Jan 16 '12 at 11:07
3 Answers
The character that poses a problem for you is U+1F603 SMILING FACE WITH OPEN MOUTH
, which has a value not representable in 16 bits. When converted to UTF-8 the byte values are f0 9f 98 83
, which should fit without issues in a utf8mb4
character set MySQL column, so I will agree with the other commenters that it doesn't look to be a MySQL issue. If you can attempt to re-insert this tweet, log all SQL statements as received by MySQL to determine if the characters get corrupted before or after sending them to MySQL.

- 16,017
- 2
- 36
- 40
-
I tried storing this tweet into MySQL on a utf8mb4 character set and but it seems to fail and I am unable to fix that issue, hence I would like to see if a string contains such a character at the application level, so that I can avoid storing such strings. – priya Jan 16 '12 at 11:13
-
2We understand that this is your diagnostic, but we think it's wrong; therefore it would help if you could add more details to support or refute your root cause analysis. Do you get an error message? Can you post the generated SQL, as requested? – tripleee Jan 17 '12 at 07:45
-
2priya, if you want to simply check the tweets, the way is easy - check to see if any character in the tweet has a UTF-8 representation larger than 3 bytes. However, as @tripleee mentions, we believe that MySQL probably is not at fault here. – Tassos Bassoukos Jan 18 '12 at 14:38
Instead of finding the special character of the string you can do one thing you can convert the string into Hex format and then back you can convert that into previous string
public static synchronized String toHex(byte [] buf){
StringBuffer strbuf = new StringBuffer(buf.length * 2);
int i;
for (i = 0; i < buf.length; i++) {
if (((int) buf[i] & 0xff) < 0x10){
strbuf.append("0");
}
strbuf.append(Long.toString((int) buf[i] & 0xff, 16));
}
return strbuf.toString();
}
By using the below function you can convert back to original string
public synchronized static byte[] hexToBytes(String hexString) {
HexBinaryAdapter adapter = new HexBinaryAdapter();
byte[] bytes = adapter.unmarshal(hexString);
return bytes;
}

- 6,557
- 14
- 55
- 86
-
I'm sorry, but this is a very haphazard way of doing stuff which would explode your performance. You lose the possibility to do string lookups as well. – parasietje Jan 17 '12 at 10:50
-
4Yes, I agree with you but I have explained the way of doing the thing, If you are not agree then its ok, but doesn't need to -1 the reputation. – Bhavik Ambani Jan 17 '12 at 12:30
If you want to avoid storing troublesome characters (the rare fancy characters outside the Basic Multilingual Plane, that give you problems), you can parse the String
's characters and discard the String
if it contains codepoints for which Character.charCount
returns 2
, or for which Character.isSupplementaryCodePoint
returns true
.
This way, as you asked, you can avoid storing those strings that (for some reason) your DBMS has trouble with.
Sources: see javadoc for
Character.charCount
Character.isSupplementaryCodePoint
and, while you're at it
String.codePointAt
String.codePointCount

- 3,073
- 3
- 30
- 46
-
for (int j = 0; j < text.length(); j++) { int codePoint = text.charAt(j); int count = Character.charCount(codePoint); log.trace("Character count for codePoint = {} is = {}", codePoint, count); } – priya Jan 19 '12 at 04:22
-
I tried the above code, but none of the count values were greater than 1, not sure if I am doing this correctly. – priya Jan 19 '12 at 04:23
-
Hmm... try something like `for(int j=0;j
1;System.out.println(isBadSymbol1_A+" "+isBadSymbol_B)}` – Unai Vivi Jan 19 '12 at 10:35 -
the above code doesn't return true for the tweet in https://twitter.com/#!/Sol_Floresita17/status/162857472661524480 – priya Jan 27 '12 at 12:12
-
How do you import your twitter message into the `text` string? It might be that you're losing information in the process, e.g. passing through a step that doesn't deal with unicode – Unai Vivi Jan 27 '12 at 12:18
-
I tested this within the context of my twitter client, I didn't run this standalone. So I am pretty sure the string which causes the problem was used by the above code snippet as input. – priya Jan 28 '12 at 04:33