62

If I convert a character to byte and then back to char, that character mysteriously disappears and becomes something else. How is this possible?

This is the code:

char a = 'È';       // line 1       
byte b = (byte)a;   // line 2       
char c = (char)b;   // line 3
System.out.println((char)c + " " + (int)c);

Until line 2 everything is fine:

  • In line 1 I could print "a" in the console and it would show "È".

  • In line 2 I could print "b" in the console and it would show -56, that is 200 because byte is signed. And 200 is "È". So it's still fine.

But what's wrong in line 3? "c" becomes something else and the program prints ? 65480. That's something completely different.

What I should write in line 3 in order to get the correct result?

tchrist
  • 78,834
  • 30
  • 123
  • 180
user1883212
  • 7,539
  • 11
  • 46
  • 82
  • 14
    A `byte` is `8 bit`. `char` is `16 bit`. Got the idea? – Rohit Jain Jul 28 '13 at 20:41
  • @RohitJain And a character — by which I mean a Unicode code point — can take two chars or four bytes. Furthermore, who knows what normalization form things are in? The string `"È"` can itself comprise one or two code points depending on whether it is in Normalization Form C or D respectively. – tchrist Jul 29 '13 at 02:49
  • 3
    Two bytes for `char` vs one for `byte` is a problem in the general case, but here, on its own, that wouldn't matter as 'È' is a codepoint below 256, so could be stored in one byte. Problem here is that `char` is unsigned while `byte` isn't. Casting `char` to `byte` only works for ASCII, so not for codepoints above 127, like here. – Lumi Feb 06 '14 at 11:22
  • Does this answer your question? [Char into byte? (Java)](https://stackoverflow.com/questions/4958658/char-into-byte-java) – user12208242 Aug 26 '20 at 09:01

3 Answers3

85

A character in Java is a Unicode code-unit which is treated as an unsigned number. So if you perform c = (char)b the value you get is 2^16 - 56 or 65536 - 56.

Or more precisely, the byte is first converted to a signed integer with the value 0xFFFFFFC8 using sign extension in a widening conversion. This in turn is then narrowed down to 0xFFC8 when casting to a char, which translates to the positive number 65480.

From the language specification:

5.1.4. Widening and Narrowing Primitive Conversion

First, the byte is converted to an int via widening primitive conversion (§5.1.2), and then the resulting int is converted to a char by narrowing primitive conversion (§5.1.3).


To get the right point use char c = (char) (b & 0xFF) which first converts the byte value of b to the positive integer 200 by using a mask, zeroing the top 24 bits after conversion: 0xFFFFFFC8 becomes 0x000000C8 or the positive number 200 in decimals.


Above is a direct explanation of what happens during conversion between the byte, int and char primitive types.

If you want to encode/decode characters from bytes, use Charset, CharsetEncoder, CharsetDecoder or one of the convenience methods such as new String(byte[] bytes, Charset charset) or String#toBytes(Charset charset). You can get the character set (such as UTF-8 or Windows-1252) from StandardCharsets.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
  • 9
    Actually, a Java `char` is not a Unicode *code **point***. It is a UTF-16 *code **unit***. To actually represent an arbitrary Unicode “character” (by which I mean an actual code point), a Java `char` is not good enough: you must use an `int` (effectively giving you UTF-32), which can take up to two chars in legacy UTF-16 notation. That’s why everything has a `codePointAt` API, not just the bad old legacy `charAt` API. – tchrist Jul 29 '13 at 02:37
  • 2
    Why is the `char c = (char) (b & 0xFF)` only using a single byte, when Java chars are supposed to be two bytes? – Cory Jun 06 '14 at 16:05
  • 1
    @Maarten .. Thanks for the nice catch. Do you know the reason as to why the byte is first widened to an integer and then narrowed to a character? Why not directly widen a byte to a character? – Rocky Inde Apr 19 '16 at 10:50
  • 2
    @RockyInde I've looked at this answer again now it is at 50 upvotes. The answer seems correct, but the answer to this comment did not. It is mainly because *everything* is generally converted to integers in Java. `int` really is the main type in Java; calculations of bytes, shorts and chars are all widened to integer types during such a calculation. This conversion is just a basic but weird example of this. – Maarten Bodewes Nov 05 '18 at 16:28
0

This worked for me: //Add import statement

import java.nio.charset.Charset;

// Change

sun.io.ByteToCharConverter.getDefault().getCharacterEncoding() -> Charset.defaultCharset()
-2

new String(byteArray, Charset.defaultCharset())

This will convert a byte array to the default charset in java. It may throw exceptions depending on what you supply with the byteArray.

Joe
  • 1,316
  • 9
  • 17
  • 1
    Wrong. From the documentation: "This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The `CharsetDecoder` class should be used when more control over the decoding process is required." So it doesn't throw exceptions as you suggest. – Maarten Bodewes Apr 13 '20 at 00:28
  • Doesn't mean it is wrong. It means if you need more control, use CharsetDecoder – Joe Apr 13 '20 at 03:28
  • No, it is wrong because you indicate that it may throw exceptions while it doesn't. Yes, you can use `CharsetDecoder` for more control, but that doesn't make the answer correct. Happy to upvote corrected answers. – Maarten Bodewes Oct 26 '20 at 22:50