Encoding issues

Question

I have a "windows1255" encoded String, is there any safe way i can convert it to a "UTF-8"

String and vice versa?

In general is there a safe way(meaning data will not be damaged) to convert between

Encodings in Java?

     str.getBytes("UTF-8");
     new String(str,"UTF-8");

if the original string is not encoded as "UTF-8" can the data be damaged?

You might have a look at this: http://stackoverflow.com/questions/4016671/how-to-parse-a-string-that-is-in-a-different-encoding-from-java — Danyel, Feb 03 '13 at 11:04

score 2 · Accepted Answer · answered Feb 03 '13 at 11:12

You can can't have a String object in Java properly encoded as anything other than UTF-16 - as that's the sole encoding for those objects defined by the spec. Of course you could do something untoward like put 1252 values in a char[] and create a String from it, but things will go wrong pretty much immediately.

What you can have is byte[] encoded in various different ways, and you can convert them to and from String using constructors which take a Charset, and with getBytes as in your code.

So you can do conversions using a String as an intermediate. I don't know of any way in the JDK to do a direct conversion, but the intermediate is likely not too costly in practice.

About round-trip comversions - it is not generally true that you can convert between encoding without losing data. Only a few encodings can handle the full spectrum of Unicode characters (eg the UTF family, GB18030, etc) - while many legacy character sets encode only a small subset. You can't safely round trip through those character sets without losing data, unless you are sure the input falls into the representable set.

score 1 · Answer 2 · answered Feb 03 '13 at 14:11

String is attempting to be a sequence of abstract characters, it does not have any encoding from the point of view of its users. Of course, it must have an internal encoding but that's an implementation detail.

It makes no sense to encode String as UTF-8, and then decode the result back as UTF-8. It will be no-op, in that:

(new String(str.getBytes("UTF-8"), "UTF-8") ).equals(str) == true;

But there are cases where the String abstraction falls apart and the above will be a "lossy" conversion. Because of the internal implementation details, a String can contain unpaired UTF-16 surrogates which cannot be represented in UTF-8 (or any encoding for that matter, including the internal UTF-16 encoding^*). So they will be lost in the encoding, and when you decode back, you get the original string without the invalid unpaired surrogates.

The only thing I can take from your question is that you have a String result from interpreting binary data as Windows-1255, where it should have been interpreted in UTF-8. To fix this, you would have to go to the source of this and use UTF-8 decoding explicitly.

If you however, only have the string result from misinterpretation, you can't really do anything as so many bytes have no representation in Windows-1255 and would have not made it to the string.

If this wasn't the case, you could fully restore the original intended message by:

new String( str.getBytes("Windows-1255"), "UTF-8");

^{* It is actually wrong of Java to allow unpaired surrogates to exist in its Strings in the first place since it's not valid UTF-16}

Actually `String` does expose the fact that it is UTF-16 encoded to its end users through pretty much every method that deals with `char`s or `Character`s. Methods such as `charAt`, any method that takes an index or length, etc, all expose the fact that `String` code units are UTF-16. This is rather unfortunate, and is probably a consequence of UCS2 being expanded to UTF-16, after this behavior in Java had already been formalized. If UCS2 hadn't been superceded, the APIs would be clean and wouldn't expose surrogates, etc. — BeeOnRope, Feb 03 '13 at 22:20
@BeeOnRope Yes, but that will only be apparent with the rarely used supplemental planes. It still works normally with BMP and with no unpaired surrogates(see footnote in the answer), which is the usual 99% situation. — Esailija, Feb 03 '13 at 22:34
Sure, but I assume you write code treats the API as it actually is, and covers the 100% case, rather than the 99% case and cross your fingers that no BMP characters ever show up. Ignoring it is like saying you can ignore RTL text in UIs, daylight savings time, integer overflow, etc because it doesn't occur more than 1% of the time. String, fundamentally presents a UTF-16 API. Most of the time you could treat it as Unicode and get away with it, but I certainly wouldn't write code that way and I would never make the statement "it doesn't have any encoding from the point of view of its users". — BeeOnRope, Feb 03 '13 at 23:50

Encoding issues

2 Answers2