70

I used RandomAccessFile to read a byte from a text file.

public static void readFile(RandomAccessFile fr) {
    byte[] cbuff = new byte[1];
    fr.read(cbuff,0,1);
    System.out.println(new String(cbuff));
}

Why am I seeing one full character being read by this?

Drew Noakes
  • 300,895
  • 165
  • 679
  • 742
Shrinath
  • 7,888
  • 13
  • 48
  • 85

8 Answers8

148

A char represents a character in Java (*). It is 2 bytes large (or 16 bits).

That doesn't necessarily mean that every representation of a character is 2 bytes long. In fact many character encodings only reserve 1 byte for every character (or use 1 byte for the most common characters).

When you call the String(byte[]) constructor you ask Java to convert the byte[] to a String using the platform's default charset(**). Since the platform default charset is usually a 1-byte encoding such as ISO-8859-1 or a variable-length encoding such as UTF-8, it can easily convert that 1 byte to a single character.

If you run that code on a platform that uses UTF-16 (or UTF-32 or UCS-2 or UCS-4 or ...) as the platform default encoding, then you will not get a valid result (you'll get a String containing the Unicode Replacement Character instead).

That's one of the reasons why you should not depend on the platform default encoding: when converting between byte[] and char[]/String or between InputStream and Reader or between OutputStream and Writer, you should always specify which encoding you want to use. If you don't, then your code will be platform-dependent.

(*) that's not entirely true: a char represents a UTF-16 code unit. Either one or two UTF-16 code units represent a Unicode code point. A Unicode code point usually represents a character, but sometimes multiple Unicode code points are used to make up a single character. But the approximation above is close enough to discuss the topic at hand.

(**) Note that on Android the default character set is always UTF-8 and starting with Java 18 the Java platform itself also switched to this default (but it can still be configured to act the legacy way)

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 1
    Surely characters in Java are 1 - 4 bytes, because of Unicode support? – Michael Feb 22 '11 at 12:57
  • 40
    @Mikaveli: no. A `char` in Java is **always** 2 bytes long. As you probably know, there are Unicode code points > 2^16. To represent those in a `String` Java uses 2 `char` values (a low-surrogate and a high-surrogate). This means that a `String` is effectively UTF-16 encoded. But that fact is outside the scope of this question. – Joachim Sauer Feb 22 '11 at 13:04
  • @Joachim: Yes, you're quite right - the code points fit in (hex) 0000 to FFFF, so natively that's 2 bytes. – Michael Feb 22 '11 at 13:11
  • 4
    @Mikaveli: this discussion is *way* out of scope of the question, but not quite: Unicode **codepoints* go from `U+0000` to `U+10FFFF` (where not all of them are used and some are declared to *never* be used). A `char` in Java can take the values `U+0000` to `U+FFFF`. To represent Unicode codepoints > `U+FFFF` you'll need to use two adjacent `char` values (one in the Low Surrogate range (U+DC00..U+DFFF) and one in the High Surrogate range (U+D800..U+DBFF)). – Joachim Sauer Feb 22 '11 at 13:16
  • @Joachim: Unicode supports more code points, but according to http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html#MAX_VALUE Java's Character class doesn't. – Michael Feb 22 '11 at 13:20
  • 1
    @Mikaveli: that's what I'm saying! `char` and `Character` only support `U+0000` to `U+FFFF`. Java (via `String`) supports the full range (by combining two `char` values to form a Unicode codepoint)! – Joachim Sauer Feb 22 '11 at 13:23
  • @Joachim: Doesn't that bring us back to "4 byte characters" then, lol? – Michael Feb 22 '11 at 13:27
  • 2
    @Mikaveli: yes, but in a way that's unrelated to the question: The question isn't actually about the internal representation of text in Java (opposite to what the title suggests), but about converting a single byte to a valid character, which can easily be explained without going into detail of the storage of textual data in Java (and explaining all that in the answer would just serve to confuse the issue even more). – Joachim Sauer Feb 22 '11 at 13:35
16

Java stores all it's "chars" internally as two bytes. However, when they become strings etc, the number of bytes will depend on your encoding.

Some characters (ASCII) are single byte, but many others are multi-byte.

Java supports Unicode, thus according to:

Java Character Docs

The max value supported is "\uFFFF" (hex FFFF, dec 65535), or 11111111 11111111 binary (two bytes).

Dimitry K
  • 2,236
  • 1
  • 28
  • 37
Michael
  • 7,348
  • 10
  • 49
  • 86
  • 2
    How does \uFFFF prove that characters can be 1-4 bytes? 0xFFFF are 2 bytes. Also: U+FFFF is **not** the highest Unicode codepoint, there are much larger ones. – Joachim Sauer Feb 22 '11 at 13:10
  • I've edited my answer - I was thinking of UTF-8, but as you state 0xFFFF fits into 2 bytes. – Michael Feb 22 '11 at 13:17
  • And `0xFFFF` is equal to decimal `65535` (five in the end) not 65536 (which in turn is equal to `0x10000` :) – Dimitry K Jun 02 '14 at 22:41
7

The constructor String(byte[] bytes) takes the bytes from the buffer and encodes them to characters.

It uses the platform default charset to encode bytes to characters. If you know, your file contains text, that is encoded in a different charset, you can use the String(byte[] bytes, String charsetName) to use the correct encoding (from bytes to characters).

Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
  • 1
    This answer can be made even better by giving examples on both single-byte encodings and multi-byte encodings and how the size varies for String. `char` on other hand will always be 2 bytes. I don't have enough experience with multiple encodings, else would do myself. – saurabheights Mar 18 '17 at 20:22
2

In ASCII text file each character is just one byte

RemoteSojourner
  • 733
  • 5
  • 6
  • 3
    ASCII itself is pretty much irrelevant these days. ASCII-based encodings, however are still around. But almost no one uses ASCII as it is. – Joachim Sauer Feb 22 '11 at 13:08
1

There are some great answers here but I wanted to point out the jvm is free to store a char value in any size space >= 2 bytes.

On many architectures there is a penalty for performing unaligned memory access so a char might easily be padded to 4 bytes. A volatile char might even be padded to the size of the CPU cache line to prevent false sharing. https://en.wikipedia.org/wiki/False_sharing

It might be non-intuitive to new Java programmers that a character array or a string is NOT simply multiple characters. You should learn and think about strings and arrays distinctly from "multiple characters".

I also want to point out that java characters are often misused. People don't realize they are writing code that won't properly handle codepoints over 16 bits in length.

1

Looks like your file contains ASCII characters, which are encoded in just 1 byte. If text file was containing non-ASCII character, e.g. 2-byte UTF-8, then you get just the first byte, not whole character.

andrew
  • 87
  • 7
  • 1
    ASCII is **not** the only single byte encoding, there are tons of others out there (such as the ISO-8859-* family, the Windows-* family, EBCDIC, KOI-8, ...). – Joachim Sauer Feb 22 '11 at 13:18
0

Java allocates 2 of 2 bytes for character as it follows UTF-16. It occupies minimum 2 bytes while storing a character, and maximum of 4 bytes. There is no 1 byte or 3 bytes of storage for character.

Siva
  • 197
  • 1
  • 3
0

The Java char is 2 bytes. But the file encoding may be different.

So first you should know what encoding your file uses. For example, the file could be UTF-8 or ASCII encoded, then you will retrieve the right chars by reading one byte at a time.

If the encoding of the file is UTF-16, it may still show you the correct char if your UTF-16 is little endian. For example, the little endian UTF-16 for A is [65, 0]. Then when you read the first byte, it returns 65. After padding with 0 for the second byte, you will get A.

bjmd
  • 21
  • 3