1

I found the following code in SO. Does this really work?

String xml = new String("áéíóúñ");
byte[] latin1 = xml.getBytes("UTF-8");
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

I mean, latin1 is UTF-8-encoded in the second line, but read als ISO-8859-1-encoded in the third? Can this ever work?

Not that I did not want to criticize the cited code, I am just confused since I ran into some legacy code that is very similar, that seems to work, and I cannot explain why.

EDIT: I guess in the original post, "UTF-8" in line 2 was just a TYPO. But I am not sure ...

EDIT2: After my initial posting, someone edited the code above and changed the 2nd line to byte[] latin1 = xml.getBytes("ISO-8859-1");. I don't know who did that and why he did it, but clearly this messed up pretty much. Sorry to all who saw the wrong version of the code. I don't know who edited it. The code above is correct now.

Community
  • 1
  • 1
gefei
  • 18,922
  • 9
  • 50
  • 67
  • 1
    You’re doing this all wrong. Do not decode. doNOT getBYTES. Just compile with `java -encoding UTF-8` or whatever your true encoding is. Java has tolerable Unicode support, but the defaults work against you. – tchrist Feb 17 '12 at 15:38
  • Your intuition is correct; line 2 is a typo or bug. The code transcodes a UTF-16 string to UTF-8, then pretends the data is ISO-8859-1 and transcodes it back to UTF-16 garbage. Then the corrupted string is transcoded to UTF-8, resulting in more garbage. – McDowell Feb 17 '12 at 15:42

2 Answers2

4

getBytes(Charset charset) results in a byte array encoded using the charset, so latin1 is UTF-8 encoded.

Put System.out.println(latin1.length); as the third line and it will tell you that byte array length is 12. This means that it is really UTF-8 encoded.

new String(latin1, "ISO-8859-1") is incorrect because latin1 is UTF-8 encoded and you're telling to parse it as ISO-8859-1. That's why it produces a String made of 12 symbols of garbage: áéíóúñ.

When you're getting bytes from áéíóúñ using UTF-8 encoding it results in a 24 long byte array.

I hope everything is clear now.

Oleg Mikheev
  • 17,186
  • 14
  • 73
  • 95
0

Those characters are present in the both character encodings. It's just that UTF-8 and ISO-8859-1 uses each different byte representations of each character beyond the ASCII range.

If you used a character which is present in UTF-8, but not in ISO-8859-1, then it will of course fail.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555