I have the following code:
String s0="Përshëndetje botë!";
byte[] b1=s0.getBytes("UTF8");
byte[] b2=s0.getBytes("ISO8859_1");
String s0_utf8=new String(b1, "UTF8"); //right encoding, wrong characters
//String s0_utf8=new String(b1, "ISO8859_1"); //wrong encoding, wrong characters
String s0_iso=new String(b2, "UTF8"); //wrong encoding; outputs right characters
//String s0_iso=new String(b2, "ISO-8859-1"); //right encoding; if uncommented, outputs damaged characters
System.out.println("s0_utf8: "+s0_utf8); //
System.out.println("s0_iso: "+s0_iso);
So, the string "Përshëndetje botë!"
is converted into bytes using UTF8
and ISO-8859-1
, then those bytes are converted back to Unicode characters using corresponding encodings. The right characters are displayed only in one case here: if we encoded the original string into bytes using ISO8859_1
and decoded it using UTF-8
. All other cases result in wrong characters.
String s0="P\u00ebrsh\u00ebndetje bot\u00eb!";
byte[] b1=s0.getBytes("UTF8");
byte[] b2=s0.getBytes("ISO8859_1");
String s0_utf8=new String(b1, "UTF8"); //right encoding; outputs right characters
//String s0_utf8=new String(b1, "ISO8859_1"); //wrong encoding, wrong characters
String s0_iso=new String(b2, "UTF8"); //wrong encoding; outputs wrong characters
//String s0_iso=new String(b2, "ISO-8859-1"); //right encoding; if uncommented, outputs damaged characters
System.out.println("s0_utf8: "+s0_utf8); //
System.out.println("s0_iso: "+s0_iso);
Here there are two cases when the right words are displayed: when the string is both encoded and decoded using the same encoding.
I don't understand what's going on here. How is that possible? What difference does Unicode's representation of characters make? Why the pair enode with iso - decode with utf8 works? Shouldn't the result string be completely different from the original, since iso's bytes might be interpreted differently by utf8?