Checking if string from database is utf-8 encoded in Java

Question

For 2 days now, I've been searching for ways to check if a value from the database is utf-8 encoded or not in Java. So far, I've read that strings in Java are using unicode (utf-16) encoding. I've tried following the suggested answer from here and here but neither seem to work properly. The first one always returns false while the second one would always return true.

An example of strings I try to check are as follows wherein everything except the last string is utf8 encoded:

ABCDEF, ｋａｔａｋａｎａ, カタカナ and �K�{�`�F�b�N�G��[

One idea that I've been trying is to get the bytes of the string using utf-8 encoding then also get the bytes of the string using the default encoding then compare like so:

byte[] utf8byte = str.getBytes("UTF-8");
byte[] bytes = str.getBytes();
if(utf8byte.length == bytes.length) {
   return true;
}

However given this logic, only the first string would return true. From my understanding, this is because not all characters use only 1 byte.

So what is the best approach you can suggest to check whether a string from the database is utf-8 encoded or not? I'd really appreciate any idea. Thanks in advanced.

My understanding, and I'm prepared to be shot down by @JonSkeet, is that in general you _can't_ determine the encoding simply by looking at the data in the byte stream. — Tim Biegeleisen, Nov 11 '15 at 05:27

score 3 · Answer 1 · answered Nov 11 '15 at 05:52

3

You can't.

The Java database driver reads the encoded byte string from the database and converts it to a Java string. The Database may choose to send the string as UTF-8, UTF-16 or any other encoding the driver understands.

Once it's a Java string it no longer contains any traces of the original encoding. getBytes() will use your system character encoding to decode the string. It has no relevance to the Database encoding.

Yes, Java uses UTF-16 under the hood but it's irrelevant.

answered Nov 11 '15 at 05:52

Alastair McCormack

26,573
8
77
100

Not sure, I understand fully so... how about if I check the encoding of the column value directly on the database? I'm using oracle db. Would that be possible? – user1597438 Nov 11 '15 at 05:58
Perhaps you should explain why you need to check the encoding – Alastair McCormack Nov 11 '15 at 06:13
Not to divulge on too many details, we're transferring our current data to a new system and database. What I'm doing is simply a "checker" to ensure that all our data are transferred correctly, including ensuring that the data are correctly encoded to utf-8. (many of the data would be in kanji and kana characters). – user1597438 Nov 11 '15 at 06:22
2

Assuming you've created your tables or columns with UTF-8 encoding, then all you need to do is to do a straight String compare of each row of data from the two databases. – Alastair McCormack Nov 11 '15 at 06:38
2

In this case, a string comparison of the old version and the new version is sufficient, if they did not convert correctly, they would not be legible to Java. – Steve K Nov 11 '15 at 06:38

Checking if string from database is utf-8 encoded in Java

1 Answers1