How to encode and decode characters in the right way

Question

I'm working on a web site in which the encoding used is UTF-8. The server side is developed with Java and the database uses Windows-1252 encoding.

How do I encode the characters correctly so they can be displayed correctly both in the database viewer and on the client side?

EDIT

Here is the code:

Class.forName("com.pervasive.jdbc.v2.Driver");

Connection conn = DriverManager.getConnection("jdbc:pervasive://XXX.XXX.XXX.XXX/TEST","xxxx", "xxxxx");

Statement stmt = conn.createStatement();

String sql = "INSERT INTO MyTest (COL1, COL2) VALUES (99999, 'Ó 456789 ñÑ; ° - + ( _ . - / \\ & <' )";
stmt.executeUpdate(sql);

The database viewer shows: ? 456789 ??; ? - + ( _ . - / \\ & < instead of Ó 456789 ñÑ; ° - + ( _ . - / \\ & <

The String ? 456789 ??; ? - + ( _ . - / \\ & < is retrieved when SELECT is excecuted

From personal experience, there's not really a good way to do so without suffering some sort of losses. One possible workaround is to have your data in the database store encoded HTML entities (such as © or € for the copyright logo), and expect those to show up as such in the database viewer while expecting to handle them in code on the Java side and convert them to and from the necessary formats. It's gross, but it highlights the need for Unicode-compliant database technology. — Shotgun Ninja, Sep 10 '15 at 17:38
@ShotgunNinja you are confusing and conflating HTML `escaping` with Character set `byte[]` encoding which are too completely different issues. — , Sep 10 '15 at 18:13
No, I'm not. The goal here is to have displayable characters/code points working on both ends, some of which are outside of the representable range of Windows-1252, if I've understood correctly. While Unicode was designed for this in mind, it isn't allowed in Pervasive SQL servers before the most recent Version 12, therefore we need to come up with a different solution that will work in the DB viewer as well as in the Java and on the web. HTML entity encoding of non-Windows-1252-represented code points is a **workaround**, not a **solution** to this problem. — Shotgun Ninja, Sep 10 '15 at 18:17
I think the problem is conflated by the fact that neither Oracle pre-10g nor Pervasive pre-v12 have *decent* JDBC implementations which properly handle situations where the underlying data is nasty (ie. represented by a single-byte code page but containing multi-byte characters, as is the case with my previous experience). — Shotgun Ninja, Sep 10 '15 at 18:20
You are confusing `Unicode` codepoint specifications which is not an encoding with encodings `UTF-8`, `UTF-16` and `UTF-32` and `Windows-1252` are encodings that all represent the same `Unicode` code point mappings but slightly different. How something displays text is the problem of the client interpreting it and displaying it not the source of the data. — , Sep 10 '15 at 18:25
No, I'm not! I'm simply suggesting another "works poorly everywhere" alternative to properly handling character encodings in Java during the process of interacting with the database. It's not ideal, obviously, but it gets the work done if the actual Java code can't be touched, but the data can be. — Shotgun Ninja, Sep 10 '15 at 18:27
this is a problem with the client software using the wrong encoding as I state in my answer, simple as that. but you never once tell us what client software package you are using or show us that code or the db server configuration so it is impossible to do anything but speculate. — , Sep 11 '15 at 02:53

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

In Java `String` uses `UTF-16` internally:

If you have a plain Java String you do not need to do anything, your JDBC driver will convert a Java String to whatever encoding it uses transparently if you insert it as a String in your insert statement.

And when you read ResultSet.getString() it will give you back a Java String transparently.

If this is not the case then something is not configured correctly in the application and is inserting bad data that is not the encoding that it says it is. Garbage In/Garbage Out.

When you need to worry about encoding/decoding:

You only have to worry about translating byte[] encodings when reading/writing textual data to files or sockets that only accept byte[].

When working with byte[] that represent text you need to use new String(bytes,Charset) and byte[] b = string.getBytes(Charset); respectfully specifying whatever encoding the source/destination String is coming in and needs to be going out.

Never rely on the default encoding:

Never use new String(byte[]) or .getBytes() which uses the default encoding which is crap shoot what you get because of all the ways that it can vary that are opaque to your code.

The subtle issue is that UTF-8, Windows-1252 and a couple of other encodings are a superset of ASCII and overlap each other as well in this range. So if you use the default encoding everything might look like it is working fine and then things blow up when you ingest/export some byte[] that contains non-ASCII range characters.

In Summary:

Never use byte[] to represent text unless some API requires you to.
Never rely on the default encoding, even if you think you know what it is.
Always specify the Charset when converted from byte[] or to byte[].
Never conflate or confuse Charset encoding with URL/URI/HTML/XML escaping.
Unicode is not an encoding.

Thanks for your response! It seems to be a configuration problem. Please check the edit on the question above, — Fernando Prieto, Sep 10 '15 at 20:49

How to encode and decode characters in the right way

1 Answers1

In Java String uses UTF-16 internally:

When you need to worry about encoding/decoding:

Never rely on the default encoding:

In Summary:

In Java `String` uses `UTF-16` internally: