In Java String
uses UTF-16
internally:
If you have a plain Java String
you do not need to do anything, your JDBC driver will convert a Java String
to whatever encoding it uses transparently if you insert it as a String
in your insert statement.
And when you read ResultSet.getString()
it will give you back a Java String
transparently.
If this is not the case then something is not configured correctly in the application and is inserting bad data that is not the encoding that it says it is. Garbage In/Garbage Out.
When you need to worry about encoding/decoding:
You only have to worry about translating byte[]
encodings when reading/writing textual data to files or sockets that only accept byte[]
.
When working with byte[]
that represent text you need to use new String(bytes,Charset)
and byte[] b = string.getBytes(Charset);
respectfully specifying whatever encoding the source/destination String
is coming in and needs to be going out.
Never rely on the default encoding:
Never use new String(byte[])
or .getBytes()
which uses the default
encoding which is crap shoot what you get because of all the ways that it can vary that are opaque to your code.
The subtle issue is that UTF-8
, Windows-1252
and a couple of other encodings are a superset of ASCII
and overlap each other as well in this range. So if you use the default
encoding everything might look like it is working fine and then things blow up when you ingest/export some byte[]
that contains non-ASCII
range characters.
In Summary:
- Never use
byte[]
to represent text unless some API requires you to.
- Never rely on the default encoding, even if you think you know what it is.
- Always specify the
Charset
when converted from byte[]
or to byte[]
.
- Never conflate or confuse
Charset
encoding with URL/URI/HTML/XML
escaping.
- Unicode is not an encoding.