2

We ran some java code using cron in Linux to persist thousands of records in production database. The locale charmap in that box was "ANSI_X3.4-1968". Now, we took following steps before persisting those to database. 1. Use StringEscapeUtils.unescapeHtml4 on the text 2. Write the String in UTF-8 format and persist in database

Now the problem is after these steps special characters are showing up as "?". Is it possible to revert it back to the original character? I have simulated the problem with following steps.

  1. Change Eclipse encoding to "ANSI_X3.4-1968"
  2. Write following lines of codes
 

String insertSpecial = StringEscapeUtils.unescapeHtml4("×");
System.out.println(insertSpecial);
String uni = new String(insertSpecial.getBytes(), "UTF-8");// This value is currently in DB
System.out.println(uni);

Now I want to get back "×" from the String "uni". Any help will be appreciated.

1 Answers1

3

Basically no. You made the biggest mistake in new String(insertSpecial.getBytes(), "UTF-8"); which again shows that character encoding is surprisingly difficult to handle.

What that piece of code does, step by step:

  1. Give me the bytes from insertSpecial in the platform encoding
  2. Create a new String from the bytes, telling that the bytes are UTF-8 (even though the bytes were gotten in platform encoding just previously)

I've seen this code several times, and unfortunately it only breaks things. It's completely unnecessary and it doesn't "convert" anything even if it were written correctly. If the platform encoding is not UTF-8 then it will most likely destroy any special characters (or even the whole String if there's a suitable difference between platform encoding and the one given in the String constructor).

The question mark is a placeholder for a character that could not be converted, meaning it's forever gone.

Here's some reading so you won't make that mistake again: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Kayaman
  • 72,141
  • 5
  • 83
  • 121
  • Basically the problem was cron job. When we tested it manually then the encoding was "UTF-8" and things were working fine. But we were not aware that when ran from cron job by default it will take different encoding. Now, after the scripts were run we found this problem and not being able to recover things as we do not have the input texts any more. – Buddha Chattopadhyay Aug 11 '16 at 08:56
  • Well, the root problem was not understanding encoding. I've seen the same exact `new String(insertSpecial.getBytes(), "UTF-8");` line several times before and I'm wondering where you come up with it? It can never work, so why are so many people trying it? – Kayaman Aug 11 '16 at 09:16