0

Please help me understand how java stores strings and char arrays. In java Character.SIZE returns 16 and most of the answers on stackoverflow and web state that character in java is 16 bits (Obviously, since it uses UTF-16 internally), however UTF-16 can't fit everything in 2 bytes. For example Chinese.

char c = '的';
System.out.println(Arrays.toString(Character.toString(c).getBytes(StandardCharsets.UTF_16)));

This piece of code prints [-2, -1, 118, -124], meaning a char was 4 bytes long. Does that mean that Strings in java that consist of char[] array, take 4 bytes for every char. That'd take too much space, so I assume that's not what happens. It must be that char has variable length. If that's so, it's impossible to store char[] as a long list of bytes in ram without specifying length of each individual char first. And that'd take too much space also.

So what's the actual size of a char in Java. And how is it stored in ram if it has variable length?

maklas
  • 332
  • 2
  • 9
  • 1
    The actual size of a `char` is 16 bits. Some characters are represented by more than one `char`. That is `char` != real-life character. – RealSkeptic Aug 21 '19 at 10:02
  • 1
    Why do you think that `的` doesn't fit in UTF-16? [It's UTF-16 representation is `0x7684`](https://ideone.com/lH4vRN), which is less than half the max value which can be represented by a char – Michael Aug 21 '19 at 10:04
  • @Michael, why does my piece of code prints an array of 4 bytes then? Now it's even more confusing... May be I provided a bad character for my example, but still, some `UTF-16` characters can't fit in 2 bytes. They're 3 or 4 bytes. That's how Unicode works. Java is still able to store them under single `char` primitive that's supposed to have 2 bytes length. – maklas Aug 21 '19 at 10:12
  • 1
    @maklas, try using `UTF_16BE` instead of `UTF_16`. That's a BOM mark you got in the first two bytes. – RealSkeptic Aug 21 '19 at 10:29

1 Answers1

2

The character you are using is a 2 byte character.

The first two bytes in the encoded byte array are UTF-16 byte order mark.

An actual 4 byte unicode code point would be represented as two chars.

final char[] chars = Character.toChars(0x2070E);
System.out.println(chars.length);
Torben
  • 3,805
  • 26
  • 31