0

I have the following code:

String s0="Përshëndetje botë!";
byte[] b1=s0.getBytes("UTF8");
byte[] b2=s0.getBytes("ISO8859_1");
String s0_utf8=new String(b1, "UTF8");  //right encoding, wrong characters
//String s0_utf8=new String(b1, "ISO8859_1"); //wrong encoding, wrong characters
String s0_iso=new String(b2, "UTF8");  //wrong encoding; outputs right characters
//String s0_iso=new String(b2, "ISO-8859-1");  //right encoding; if uncommented, outputs damaged characters
System.out.println("s0_utf8: "+s0_utf8);  //
System.out.println("s0_iso: "+s0_iso);

So, the string "Përshëndetje botë!" is converted into bytes using UTF8 and ISO-8859-1, then those bytes are converted back to Unicode characters using corresponding encodings. The right characters are displayed only in one case here: if we encoded the original string into bytes using ISO8859_1 and decoded it using UTF-8. All other cases result in wrong characters.

String s0="P\u00ebrsh\u00ebndetje bot\u00eb!";
byte[] b1=s0.getBytes("UTF8");
byte[] b2=s0.getBytes("ISO8859_1");
String s0_utf8=new String(b1, "UTF8"); //right encoding; outputs right characters
//String s0_utf8=new String(b1, "ISO8859_1"); //wrong encoding, wrong characters
String s0_iso=new String(b2, "UTF8");  //wrong encoding; outputs wrong characters
//String s0_iso=new String(b2, "ISO-8859-1");  //right encoding; if uncommented, outputs damaged characters
System.out.println("s0_utf8: "+s0_utf8);  //
System.out.println("s0_iso: "+s0_iso);

Here there are two cases when the right words are displayed: when the string is both encoded and decoded using the same encoding.

I don't understand what's going on here. How is that possible? What difference does Unicode's representation of characters make? Why the pair enode with iso - decode with utf8 works? Shouldn't the result string be completely different from the original, since iso's bytes might be interpreted differently by utf8?

parsecer
  • 4,758
  • 13
  • 71
  • 140
  • [I don't get the behavior you describe at all.](http://ideone.com/1oNIwo) Encoded as utf8, decoded as utf8 prints the right string. Encoded as iso, decoded as utf8 doesn't. – Sotirios Delimanolis Feb 28 '17 at 17:42
  • @Sotirios Delimanolis Hm, I tried the online compiler here https://www.compilejava.net/, at it works just as explained... – parsecer Feb 28 '17 at 17:51

2 Answers2

2

My guess is that the strings are wrong from the start, because your Java source file is encoded in encoding A, and the compiler reads it with encoding B. That explains why the problem doesn't happen when you use escape sequences rather than accents.

Regarding

//String s0_iso=new String(b2, "ISO-5589-1");  //right encoding; if uncommented, outputs damaged characters

no, it's not the right encoding. 5589 != 8859.

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255
  • That was a typo here; edited the post. Could you please explaing what happens when compiler comes into play? I thought the `.getBytes()` method helps to make the compiler's encoding irrelevant - whatever encoding might be used during compilation, the bytes I would get would correspond with what I passed to the method. The file is in UTF-8, the right encoding should be ISO-8859-1. But I wanted to try to deal with this only through code alone, not changing file/compiler's encoding. – parsecer Feb 28 '17 at 17:47
  • Your code says: take this string, and transforms it to bytes. This happens at runtime. So the bytecode needs to contain the characters of the literal string contained in your code. But to put these characters into the bytecode, the compiler needs to read them from the Java source file. If the file is encoded in UTF8, but you tell the compiler that it's UTF8, the compiler will not decode the bytes of your Java source file correctly, and will thus put incorrect characters in the bytecode (just like if you read an UTF8 file with ISO88591). – JB Nizet Feb 28 '17 at 17:54
  • So at first we have UTF8 bytecode with wrong bytes for ISO characters. Then in runtime JVM maps those bytecodes back to Unicode characters and encodes it back again to UTF8 and ISO? How come the damaged symbols get displayed anyway, if everything is wrong with encodings? – parsecer Feb 28 '17 at 18:22
  • In my case you were right - i assumed javac would interpret the source code (and thus my string-constants in the source code) as UTF-8, but it didn't. To fix add the "-encoding utf8" switch to javac: `javac -encoding utf8 ...etc...`. – DisplayName May 07 '20 at 15:32
0

This answer really helped me to understand what's going on.

In the first case:

String s0="Përshëndetje botë!";

s0 is in ISO8859_1;

b1: get bytes in UTF-8,

b2: get bytes in ISO8859_1.

IDEA converts the ë characters wrongly => Përshëndetje botë!

String s0_iso=new String(b2, "UTF8"); converts the string into the IDEA's encoding and it gets printed correctly.

String s0_iso=new String(b2, "ISO-8859-1"); doesn't change the original string => Përshëndetje botë!

When the string gets converted into foreign encoding (UTF-8), the trouble is coming:

String d=new String(b1, "UTF8"); => Përshëndetje botë!

String b=new String(b1, "ISO8859_1");=> Përshëndetje botë!

I'm still not entirely sure what's going on in these two cases but

d.equals("Përshëndetje botë!") is true.

My guess is when the string is compiled in utf-8 (default) compiler interpreters the characters in s0 as if they were in UTF-0 already and no real conversion happens; the characters turn out damaged because there is nothing like this in UTF-8. During the construction of the d string literaly the same happens, but through the code itself, so the characters are handled as if they are already in UTF-8 and then pushed to a String in the same UTF-8. But they should have been decoded from ISO8859_1 first and only then encoded into UTF-8 so that's why the output turns out wrong.

In the second case:

String s0="P\u00ebrsh\u00ebndetje bot\u00eb!";

the original string is already fully in UTF-8. Therefore there will be less problems with displaying it.

String d = new String(b1, "UTF8") doesn't change the original string; d.equals(s0) is true => Përshëndetje botë!

String p =new String(b1, "ISO8859_1") converts the original UTF-8 string into ISO8859_1 => Përshëndetje botë!

p.equals("Përshëndetje botë!") is true.

Not sure what's going on here though and why the last one gets all characters correctly:

String s0_iso=new String(b2, "UTF8") => P�rsh�ndetje bot�

String s0_iso=new String(b2, "ISO-8859-1") => Përshëndetje botë!

Community
  • 1
  • 1
parsecer
  • 4,758
  • 13
  • 71
  • 140