Why getBytes() encoding conversion gives these results

Question

I have a String in UTF-8 which I first convert into ISO-8859_1 and then convert it back to UTF-8 and get ISO8859_1 bytes from it. The result is supposed to be ISO-8859-1 again, but instead it gives me UTF-8 bytes. Why?

  import java.io.UnsupportedEncodingException;      

  public class Test  {
    public static void main(String[] args) throws
        UnsupportedEncodingException  {
        String s0 = "H\u00ebllo";
        byte[] bytes = s0.getBytes("ISO8859_1");
        byte[] bytes1=s0.getBytes("UTF-8");
        printBytes(bytes, "bytes");  //72 -21 108 108 111  (ISO-8859-1)
        printBytes(bytes1, "bytes1");  //72 -61 -85 108 108 111  (UTF-8)
        byte[] bytes2=new String(s0.getBytes("UTF-8"), "ISO8859_1").getBytes("ISO8859_1");
        printBytes(bytes2, "bytes2");  //72 -61 -85 108 108 111  (UTF-8)
       }


   private static void printBytes(byte[] array, String name)  {
           System.out.print(name+": ");
            for(int i=0; i<array.length; i++)  {
                    System.out.print(array[i] + " ");
            }
            System.out.println();
      }
    }

Have you tried using the java.nio.charset.StandardCharsets instead of the string representation? (instead of "UTF-8", StandardCharsets.UTF_8, and so on, see the doc: http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html) — Adonis, Mar 04 '17 at 00:57

score 1 · Accepted Answer · answered Mar 04 '17 at 01:01

1

This makes no sense:

new String(s0.getBytes("UTF-8"), "ISO8859_1")

You are interpreting a UTF-8 byte[] with ISO8859_1 encoding. You should interpret UTF-8 bytes with UTF-8 encoding:

new String(s0.getBytes("UTF-8"), "UTF-8")

Then it will print:

bytes: 72 -21 108 108 111 
bytes1: 72 -61 -85 108 108 111 
bytes2: 72 -21 108 108 111

You also say:

I have a String in UTF-8

Strings don't have a well-defined internal encoding, it is an implementation detail. After you create a String there is no encoding, you just have a String. You can, however, get a byte[] from it with a specific encoding.

answered Mar 04 '17 at 01:01

Jorn Vernee

31,735
4
76
93

Thank you for your answer. It makes sense now. – parsecer Mar 04 '17 at 01:09
The idea of the String encoding came from this answer http://stackoverflow.com/a/38913688/4759176 - the String in ISO obviously isn't displayed properly, while the String in UTF-8 is fine, so it looks like there are encodings... – parsecer Mar 04 '17 at 01:10
Now that I looked, in that answer the same approach is used - the bytes are taken in encoding `X` and the second argument to the String constructor is an encoding `Y`... – parsecer Mar 04 '17 at 01:16
@parsecer That's the encoding of the `byte[]`. The `String` uses a different encoding internally. Where the `String` contains some characters that can not be represented with ISO encoding, so they don't display properly. The point I'm trying to make is that there is a given level of abstraction. Sure `String`s have an internal encoding, but which one is not made public through it's interface, and shouldn't matter in it's use. – Jorn Vernee Mar 04 '17 at 01:18
@parsecer About that answer. All I can say is that it looks wrong to me (and by my best logic _is_ wrong). [The Javadoc](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#String-byte:A-java.lang.String-) also seems to agree with me saying that that constructor: _"Constructs a new String by decoding the specified array of bytes using the specified charset."_. Besides, you saw yourself that it didn't work right? – Jorn Vernee Mar 04 '17 at 01:20
If the String encoding is not that important, then why using `String new ="Hëllo"` `printBytes(new.getBytes("ISO8859_1")` results in printing out the same `UTF-8` characters, rather than `ISO`'s `-21`? Here the `ë` is the representation of the `\u00eb`. – parsecer Mar 04 '17 at 01:50
Figured that out. Java's storing of the `new` string is at fault, it confuses which encoding this String bytes are initially in. So it's not the method's fault. – parsecer Mar 04 '17 at 03:56
And that's the reason that answer of webmaster works - Java stores `UTF-8` under the `ISO` so when you request `ISO` it really returns you the `UTF-8` representation (in case the initial string is character-gaberish). – parsecer Mar 04 '17 at 03:59
@parsecer You have in your example a `UTF-8` byte array. You then create a `String` from that using the `ISO` encoder. Then you translate it back (with `getBytes`) using the same encoder, which basically does the reverse, so you get a `UTF-8` byte array back. That's also why the other answer works; if you interpret `UTF-8` with `ISO` you see gibberish, but if you then apply the reverse operation to that gibberish, you get `UTF-8` back. – Jorn Vernee Mar 04 '17 at 11:54
`new String(s0.getBytes("UTF-8"), "UTF-8")` yields the same `String` value as `s0`, so going through `getBytes()` is redundant when you can just use `s0` as-is. – Remy Lebeau Mar 08 '17 at 22:57

Why getBytes() encoding conversion gives these results

1 Answers1