I maintain a Java webapp that runs on Windows, but which writes to MariaDB on a Linux box. The webapp has a textarea (many, actually) that users typically want to paste in text from a MS Word doc.
An older version of this webapp ran on a small Windows server, writing to mysql on that box. This worked, although it had some operational problems.
Since I've ported the app to Linux, when users paste text into forms that have "special" characters (bullets, em dashes, smart quotes, et cetera), the save to MariaDB fails, as some of the characters are not valid in UTF-8.
So, I guess I need to implement some sort of character set conversion. Getting this exactly right seems to be quite difficult. The basic idea is straightforward.
I created a small Word doc with the following visual contents:
• This is an item – with a dash
I then tried pasting this into Textpad and saved that file to disk. I dumped the contents with "od -x", then reversed the bytes, and then put them into a byte array, like this:
byte[] data = {(byte)0x95, 0x09, 0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x6e, 0x20, 0x69, 0x74, 0x65, 0x6d, 0x20, (byte)0x96, 0x20, 0x77, 0x69, 0x74, 0x68, 0x20, 0x61, 0x20, 0x64, 0x61, 0x73, 0x68, 0x00};
I would expect this reflects "windows-1252" encoding, but I really have no idea.
I then attempted to convert this to UTF-8:
byte[] utf8Data = new String(data, "windows-1252").getBytes("UTF-8");
And then I printed the string, which resulted in:
• This is an item – with a dash
I'm not really sure what I'm doing here, or whether it's even possible to do this completely.
I also need to determine in the webapp itself, what character encoding I should expect text in forms to use. I really don't want to assume it's "windows-1252".
My brain is hurting.