23
public class UTF8 {
    public static void main(String[] args){
        String s = "ヨ"; //0xFF6E
        System.out.println(s.getBytes().length);//length of the string
        System.out.println(s.charAt(0));//first character in the string
    }
}

output:

3
ヨ

Please help me understand this. Trying to understand how utf8 encoding works in java. As per java doc definition of char char: The char data type is a single 16-bit Unicode character.

Does it mean char type in java can only support those unicode characters that can be represented with 2 bytes and not more than that?

In the above program, the no of bytes allocated for that string is 3 but in the third line which returns first character( 2 bytes in java) can hold a character which is 3 bytes long? really confused here?

Any good references regarding this concept in java/ general would be really appreciated.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
akd
  • 1,427
  • 3
  • 16
  • 21

4 Answers4

35

Nothing in your code example is directly using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

So in your example, the Unicode character is stored in the String in-memory storage using two bytes that are UTF-16 encoded (0x6E 0xFF or 0xFF 0x6E depending on endian), but is stored in the byte array from getBytes() using three bytes that are encoded using whatever the OS default charset is.

In UTF-8, that particular Unicode character happens to use 3 bytes as well (0xEF 0xBD 0xAE).

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
4

String.getBytes() returns the bytes using the platform's default character encoding which does not necessary match internal representation.

You're best of never using this method in most cases, because in most cases it does not make sense to rely on platform's default encoding. Use String.getBytes(String charsetName) instead and explicit specify the character set that should be used for encoding your String into bytes.

Fabian Barney
  • 14,219
  • 5
  • 40
  • 60
3

UTF-8 is a variable length encoding, that uses only one byte for ASCII chars (values between 0 and 127), and two, three (or even more) bytes for other unicode symbols.

This is because the higher bit of the byte is used to tell "this is a multi byte sequence", so one bit on 8 is not used to actually represent "real" data (the char code) but to mark the byte.

So, despite Java using 2 bytes in ram for each char, when chars are "serialized" using UTF-8, they may produce one, two or three bytes in the resulting byte array, that's how the UTF-8 encoding works.

Simone Gianni
  • 11,426
  • 40
  • 49
  • UTF-8 uses a maximum of 2 bytes – adosaiguas Aug 29 '12 at 23:09
  • 6
    UTF-8 uses a maximum of 4 bytes, not 2 bytes (6 bytes if you consider older UTF-8 specs before UTF-8 was modified to not exceed the codepoints that UTF-16 supports). – Remy Lebeau Aug 29 '12 at 23:11
  • @adosaiguas "UTF-8 encodes each of the 1,112,064[7] code points in the Unicode character set using one to four 8-bit bytes" (wikipedia) – Simone Gianni Aug 29 '12 at 23:11
  • @RemyLebeau You are both right, sorry, I always though it was UTF-8 max 2 bytes and UTF-16 max 4 bytes. – adosaiguas Aug 29 '12 at 23:22
  • 1
    question regarding third statement. "despite Java using 2 bytes in ram for each char" . Does it mean java uses 16 bits to represent 1,112,064 code points of the unicode? isn't (2 power 16) less than the no of code points? Is this a valid question at all? – akd Aug 29 '12 at 23:56
  • Java natively supports unicode chars from 0x0000 to 0xFFFF. This has nothing to do with UTF-8 that supports more, and means that Java cannot (natively, but has been expanded) read any and every UTF-8 text file into String. To support chars above unicode 0xFFFF, JSR 204 has been written, see here : http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ – Simone Gianni Aug 30 '12 at 12:40
  • @adosaiguas Please delete your comment which has false information. – Koray Tugay Jan 28 '16 at 16:14
  • @KorayTugay It's already clear by the other's comments that my initial comment was not correct. But if I remove it, other's comments won't make any sense, so that's why it is there. I don't think it makes sense to remove a 2 years old comment... Should we delete all comment in this answer then? – adosaiguas Feb 01 '16 at 16:13
  • @KorayTugay We are adding useless comments to this answer. IMHO, deleting, or editing it, makes some of the other comments lose their sense, so it makes no sense doing what you request. – adosaiguas Feb 03 '16 at 19:20
3

This is how Java represents characters.

Gray
  • 115,027
  • 24
  • 293
  • 354
adosaiguas
  • 1,331
  • 9
  • 13