Properly convert text from MS Word ("windows-1252"?) to utf-8 in Java

Question

I maintain a Java webapp that runs on Windows, but which writes to MariaDB on a Linux box. The webapp has a textarea (many, actually) that users typically want to paste in text from a MS Word doc.

An older version of this webapp ran on a small Windows server, writing to mysql on that box. This worked, although it had some operational problems.

Since I've ported the app to Linux, when users paste text into forms that have "special" characters (bullets, em dashes, smart quotes, et cetera), the save to MariaDB fails, as some of the characters are not valid in UTF-8.

So, I guess I need to implement some sort of character set conversion. Getting this exactly right seems to be quite difficult. The basic idea is straightforward.

I created a small Word doc with the following visual contents:

•   This is an item – with a dash

I then tried pasting this into Textpad and saved that file to disk. I dumped the contents with "od -x", then reversed the bytes, and then put them into a byte array, like this:

byte[] data = {(byte)0x95, 0x09, 0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x6e, 0x20, 0x69, 0x74, 0x65, 0x6d, 0x20, (byte)0x96, 0x20, 0x77, 0x69, 0x74, 0x68, 0x20, 0x61, 0x20, 0x64, 0x61, 0x73, 0x68, 0x00};

I would expect this reflects "windows-1252" encoding, but I really have no idea.

I then attempted to convert this to UTF-8:

byte[] utf8Data = new String(data, "windows-1252").getBytes("UTF-8");

And then I printed the string, which resulted in:

â€¢ This is an item â€“ with a dash

I'm not really sure what I'm doing here, or whether it's even possible to do this completely.

I also need to determine in the webapp itself, what character encoding I should expect text in forms to use. I really don't want to assume it's "windows-1252".

My brain is hurting.

Possible duplicate: https://stackoverflow.com/questions/23082522/java-convert-windows-1252-to-utf-8-some-letters-are-wrong — Jacob B., Nov 30 '17 at 19:53
To the extent that it's talking about conversions from perhaps the same encoding to utf-8, that would be true, but otherwise that doesn't help me. — David M. Karr, Nov 30 '17 at 20:01
I understand. If I didn't respond to it, someone might think I didn't investigate the possibility that it might solve my problem. — David M. Karr, Nov 30 '17 at 20:06
You'll have to learn the basics first — here's [an introduction on Joel on software](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/). Then you'll have to make sure the text encoding is correctly declared in MariaDB and your web server. — roeland, Nov 30 '17 at 21:12
I really already know the basics. I read Joel's doc, and I've read it before, but I didn't really learn anything new. — David M. Karr, Nov 30 '17 at 22:46
"*some of the characters are not valid in UTF-8*" - this is not true at all. ALL Unicode characters can be encoded in UTF-8. You just have to make sure that you take the encoding of your web UI into account and convert the text to UTF-8 properly (if it is not already) *before* giving it to MariaDB. Be explicit about your webform's charset in your HTML, don't rely on the browser to decide (and if it does, be sure to pay attention to the charset reported in the HTTP post, per the HTML specs). The HTML `
` element has an `accept-charset` attribute. — Remy Lebeau, Dec 01 '17 at 00:14
`â€¢ This is an item â€“ with a dash` is what happens when the (correct) UTF-8 encoded form of `• This is an item – with a dash` is *interpretted* as Windows-1252 or similar charset instead of as UTF-8. — Remy Lebeau, Dec 01 '17 at 00:16
So what impact does the "accept-charset" attribute actually have? The docs I see just say that that is the charset it will use. So what does that mean? If I set accept-charset to "utf-8" and then the user pastes text from a word doc with special chars in it, what does that do? — David M. Karr, Dec 01 '17 at 22:44

Properly convert text from MS Word ("windows-1252"?) to utf-8 in Java

0 Answers0