1

I do not understand why this code is not outputting the same thing? I thought the Java automatically figures out the encoding of the string?

public static void main (String[] args) {
    try {
        displayStringAsHex("A B C \u03A9".getBytes("UTF-8"));
        System.out.println ("");
        displayStringAsHex("A B C \u03A9".getBytes("UTF-16"));
    } catch (UnsupportedEncodingException ex) {
        ex.printStackTrace();
    }
}

/** 
 * I got part of this from: http://rgagnon.com/javadetails/java-0596.html
 */
public static void displayStringAsHex(byte[] raw ) {
    String HEXES = "0123456789ABCDEF";
    System.out.println("raw = " + new String(raw));
    final StringBuilder hex = new StringBuilder( 2 * raw.length );
    for ( final byte b : raw ) {
      hex.append(HEXES.charAt((b & 0xF0) >> 4))
         .append(HEXES.charAt((b & 0x0F))).append(" ");
    }
    System.out.println ("hex.toString() = "+ hex.toString());
}

outputs:

(UTF-8)
hex.toString() = 41 20 42 20 43 20 CE A9 

(UTF 16)
hex.toString() = FE FF 00 41 00 20 00 42 00 20 00 43 00 20 03 A9

I cannot display the character output, but the UTF-8 version looks correct. The UTF-16 version has several squares and blocks.

Why don't they look the same?

Java Addict
  • 175
  • 8
  • Why would they output the same thing? UTF-8 and UTF-16 are two completely different encoding schemes. And this has nothing to do with "Java automatically figuring out the encoding". It's a matter of whether whatever you're using to display that encoded text can figure out the encoding or not. – JLRishe Apr 05 '14 at 04:39
  • Actually they look the same if you notice the first UTF-8 string patterns occurs in the second string UTF-16, check the sequence: 41 20 42 20 43 20 since UTF-16 addresses the double of size than UTF-8 it can map a wider variety of languages: perhaps the answer of this question may help: http://stackoverflow.com/questions/4655250/difference-between-utf-8-and-utf-16 – guilhebl Apr 05 '14 at 04:43

1 Answers1

2

Java does not automatically figure out the encoding of a string.

The String(byte[]) constructor

constructs a new String by decoding the specified array of bytes using the platform's default charset.`

In your case the UTF-16 bytes are being interpreted as UTF-8 and you end up with garbage. Use new String(raw, Charset.forName("UTF-16")) to rebuild the String.

user695022
  • 579
  • 5
  • 16