45

Is there any reason why Java char primitive data type is 2 bytes unlike C which is 1 byte?

Thanks

realnumber
  • 2,124
  • 5
  • 25
  • 33
  • 5
    The short answer is because they goofed: they should have used 32-bit characters. – tchrist Apr 08 '11 at 12:25
  • 1
    No, they should not have used 32-bit wide characters. That would make overhead even worse! – vy32 Jul 04 '11 at 04:13
  • 8
    @vy32: Yeah. They should really have used [6-bit-wide characters](https://en.wikipedia.org/wiki/Six-bit_character_code). That would save space, and after all, capital letters should be enough for everybody. – Mechanical snail Jul 15 '12 at 03:41
  • 1
    5 bits per character are enough if you want to be space-efficient. In fact, the remaining 4 permutations can also be used - saving even more space. – specializt Sep 29 '14 at 22:03

8 Answers8

62

When Java was originally designed, it was anticipated that any Unicode character would fit in 2 bytes (16 bits), so char and Character were designed accordingly. In fact, a Unicode character can now require up to 4 bytes. Thus, UTF-16, the internal Java encoding, requires supplementary characters use 2 code units. Characters in the Basic Multilingual Plane (the most common ones) still use 1. A Java char is used for each code unit. This Sun article explains it well.

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
  • 6
    I'm sure Joel will appreciate the plug for "what every programmer should know about character encoding: http://joelonsoftware.com/articles/Unicode.html – fooMonster Nov 10 '11 at 14:56
  • I do agree with your answer. But I have one question, In String class all data stored in byte array which takes one byte for single character(After applying some compression mechanism ). Why same approach not followed for characters, or Java 8 and upper version came with drastic changes why they have not modified the same? – Vinay Sharma Sep 21 '21 at 15:14
  • @VinaySharma String was changed to use `byte[]` instead of `char[]` in Java 9 ([JEP 254](https://openjdk.java.net/jeps/254)), with an extra "coder" field to switch a given instance between interpreting the bytes individually (Latin-1) or in pairs (UTF-16) based on whether the instance contains only Latin-1 characters or not, respectively. The size of the `char` type in the JVM has not changed; this compact form only really applies inside `java.lang.String` which does tend to make up _a lot_ of the heap space in many applications. – William Price Feb 02 '22 at 19:45
24

char in Java is UTF-16 encoded, which requires a minimum of 16-bits of storage for each character.

Vijay Mathew
  • 26,737
  • 4
  • 62
  • 93
12

In Java, a character is encoded in UTF-16 which uses 2 bytes, while a normal C string is more or less just a bunch of bytes. When C was designed, using ASCII (which only covers the english language character set) was deemed sufficient, while the Java designers already accounted for internationalization. If you want to use Unicode with C strings, the UTF-8 encoding is the preferred way as it has ASCII as a subset and does not use the 0 byte (unlike UTF-16), which is used as a end-of-string marker in C. Such an end-of-string marker is not necessary in Java as a string is a complex type here, with an explicit length.

DarkDust
  • 90,870
  • 19
  • 190
  • 224
8

In previous languages like C ASCII notations are used. And the range is 127 , for 127 unique symbols and language characters.

While JAVA comes with a feature called "INTERNATIONALIZATION", that is all the Human Readable characters(Including Regional symbols) are also added into it , and the range is also increased , so more the memory required , the system to unify all these symbols is "Standard Unicode System", and so that this Unification requires that additional byte in JAVA.

The first byte remains as it is and ASCII characters are ranged to 127 as in C,C++ but unified characters are than appended to them.

So 16-bits for char in JAVA and 8-bits for char in C.

tilak
  • 137
  • 2
  • 9
2

Java™ Tutorials:

The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

Pang
  • 9,564
  • 146
  • 81
  • 122
Zeyu
  • 55
  • 10
0

Java uses UNICODE (Universal Code) representation which accepts all the language formats in the world.

     ASCII  American Standard Code for Information Exchange

     ISO 8859-1 for western European Countries

     KOI-8 for Russian

     GB10830 & BIG-5 for Chinese
         

In this 1 byte is reserved for ASCII & remaining 1 byte can accept any other language => 2byte for char

while C/C++ uses only ASCII Representation => 1 byte for char

rohit.khurmi095
  • 2,203
  • 1
  • 14
  • 12
-2

Java used as a internationalize so, its work in different languages and need to space more than one byte, that's why its take 2byte of space in char. for eg the chinese language can't hanfle one byte of char.

-2

As we know c suppors ASCII where as java supports Unicode which contains 3 things that is 1-ASCII 2-extended ASCII 3-local language character ASCII is a subset of unicode.ASCII supports only English language where as Unicode supports multinationals language.otherwise java character is encoded within UTF-16 which uses 2 byte.for all of the reason and as the Unicode is the extended version of ASCII ,so it uses 16 bit insted of 8 bit.