14

So a 'char' in Java is 2 bytes. (Can be verified from here.)

I have this sample code:

public class FooBar {
    public static void main(String[] args) {
        String foo = "€";
        System.out.println(foo.getBytes().length);
        final char[] chars = foo.toCharArray();
        System.out.println(chars[0]);
    }
}

And the output is as follows:

3
€

My question is, how did Java fit a 3 byte character into a char data type? BTW, I am running the application with the parameter: -Dfile.encoding=UTF-8

Also if I edit the code a little further and add the following statements:

File baz = new File("baz.txt");
final DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream(baz));
dataOutputStream.writeChar(chars[0]);
dataOutputStream.flush();
dataOutputStream.close();

the final file "baz.txt" will only be 2 bytes, and it will not show the correct character even if I treat it as a UTF-8 file.

Edit 2: If I open the file "baz.txt" with encoding UTF-16 BE, I will see the € character just fine in my text editor, which makes sense I guess.

Koray Tugay
  • 22,894
  • 45
  • 188
  • 319
  • 3
    Java uses UTF-16 internally. See http://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16 – Thomas Stets Jan 21 '16 at 11:24
  • Char is not a character; it's less - which is one of the biggest problems with Java. See utf8everywhere.org for complete explanation on how it all works. – Pavel Radzivilovsky Jan 22 '16 at 02:46

2 Answers2

10

String.getBytes() returns the bytes using the platform's default character encoding which does not necessary match internal representation.

Java using 2 bytes in ram for each char, when chars are "serialized" using UTF-8, they may produce one, two or three bytes in the resulting byte array, that's how the UTF-8 encoding works.

Your code example is using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

Check this link : java utf8 encoding - char, string types

Community
  • 1
  • 1
Shiladittya Chakraborty
  • 4,270
  • 8
  • 45
  • 94
8

Java uses UTF-16 (16 bits) for the in-memory representation.

That Euro symbol fits into that, even though it needs three bytes in UTF-8.

Thilo
  • 257,207
  • 101
  • 511
  • 656
  • 2
    Yes, and that is a bit of a problem, because Unicode is bigger than that. Some Unicode codepoints require two chars in Java now. So the result of `length` or `charAt` may not be entirely satisfactory if you use the "whole catalogue". – Thilo Jan 21 '16 at 11:31
  • So the parameter I pass -Dfile.encoding=UTF-8 does not really change much, can we say? – Koray Tugay Jan 21 '16 at 11:32
  • 2
    That parameter defines the default encoding, what you get by calling `getBytes()` without specifying a character set (don't do that, always declare the character encoding). – Thilo Jan 21 '16 at 11:34