15

Why does a character in Java take twice as much space to store as a character in C?

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
ion3023
  • 151
  • 1
  • 1
  • 4

5 Answers5

27

In Java characters are 16-bit and C they are 8-bit.

A more general question is why is this so?

To find out why you need to look at history and come to conclusions/opinions on the subject.

When C was developed in the USA, ASCII was pretty standard there and you only really needed 7-bits, but with 8 you could handle some non-ASCII characters as well. It might seem more than enough. Many text based protocols like SMTP (email), XML and FIX, still only use ASCII character. Email and XML encode non ASCII characters. Binary files, sockets and stream are still only 8-bit byte native.

BTW: C can support wider characters, but that is not plain char

When Java was developed 16-bit seemed like enough to support most languages. Since then unicode has been extended to characters above 65535 and Java has had to add support for codepoints which is UTF-16 characters and can be one or two 16-bit characters.

So making a byte a byte and char an unsigned 16-bit value made sense at the time.

BTW: If your JVM supports -XX:+UseCompressedStrings it can use bytes instead of chars for Strings which only use 8-bit characters.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
2

Because Java uses Unicode, C generally uses ASCII by default.

There are various flavours of Unicode encoding, but Java uses UTF-16, which uses either one or two 16-bit code units per character. ASCII always uses one byte per character.

DNA
  • 42,007
  • 12
  • 107
  • 146
1

Java is a modern language that came up around the early Unicode era (in the beginning of the 90s), so it supports Unicode by default as a first class citizen like many new contemporary languages (like Python, Visual Basic or JavaScript...), OSes (Windows, Symbian, BREW...) and frameworks/interfaces/specifications... (like Qt, NTFS, Joliet). By the time those were designed, Unicode was a fixed 16-bit charset encoded in UCS-2, so it made sense for them to use 16-bit values for the characters

In contrast C is an "ancient" language that was invented decades before Java, when Unicode was far from a thing. That's the age of 7-bit ASCII and 8-bit EBCDIC, thus C uses 8-bit char1 as that's enough for a char variable to contain all basic characters. When coming to the Unicode times, to refrain from breaking old code they decided to introduce a different character type to C90 which is wchar_t. Again this is the 90s when Unicode began its life. In any cases char must continue to have the old size because you still need to access individual bytes even if you use wider characters (Java has a separate byte type for this purpose)


Of course later the Unicode Consortium quickly realized that 16 bits are not enough and must fix it somehow. They widened the code-point range by changing UCS-2 to UTF-16 to avoid breaking old code that uses wide char and have Unicode as a 21-bit charset (actually up to U+10FFFF instead of U+1FFFFF because of UTF-16). Unfortunately it was too late and the old implementations that use 16-bit char got stuck

Later we saw the advent of UTF-8, which proved to be far superior to UTF-16 because it's independent of endianness, generally takes up less space, and most importantly it requires no changes in the standard C string functions. Most user functions that receive a char* will continue to work without special Unicode support

Unix systems are lucky because they migrate to Unicode later when UTF-8 was introduced, therefore continue to use 8-bit char. OTOH all modern Win32 APIs work on 16-bit wchar_t by default because Windows was also an early adopter of Unicode. As a result .NET framework and C# also go the same way by having char as a 16-bit type.


Talking about wchar_t, it was so unportable that both C and C++ standards needed to introduce the new character types char16_t and char32_t in their 2011 revisions

Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t implementation-defined

https://en.wikipedia.org/wiki/Wide_character#Programming_specifics

That said, most implementations are working on improving the wide string situation. Java experimented with compressed string in Java 6 and introduced compact strings in Java 9. Python is moving to a more flexible internal representation compared to wchar_t* in Python before 3.3. Firefox and Chrome have separate internal 8-bit char representations for simple strings. There are also discussions on that for .NET framework. And more recently Windows is gradually introducing UTF-8 support for the old ANSI APIs


1 Strictly speaking char in C is only required to have at least 8 bits. See What platforms have something other than 8-bit char?

Community
  • 1
  • 1
phuclv
  • 37,963
  • 15
  • 156
  • 475
1

The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.

Dmytro Chyzhykov
  • 1,814
  • 1
  • 20
  • 17
0

Java char is an UTF-16-encoded Unicode code point while C uses ASCII encoding in most of the cases.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Pico
  • 593
  • 1
  • 5
  • 15
  • I thought I would simplify the phrasing but you're right. 'A unicode char' it's just wrong. Edited the reply. – Pico Feb 21 '12 at 00:03