Java Byte to Char conversion

Question

I read from a TCP/IP socket s:

byte[] bbuf = new byte[30];
s.getInputStream().read(bbuf);
for (int i = 0; i < bbuf.length; i++)
{
     System.out.println(Integer.toHexString( (int) (bbuf[i] & 0xff)));
}

This outputs CA 68 9F 75 which is what I would expect. Now I want to use chars instead

char[] cbuf = new char[30];
BufferedReader input =  new BufferedReader(new InputStreamReader(s.getInputStream())); 
for (int i = 0; i < cbuf.length; i++)
{
     System.out.println(Integer.toHexString( (int) (cbuf[i] )));
}

Now the output is CA 68 178 75. So the third Byte (and only the third byte) makes the difference. I assume it has to do with the character sets and that I have to specify a character set in the InputStreamer. I have no idea how to find out what character set I have to use. Secondly I am surprised if it is due to character sets that I only get the mess with exactly one character. I tried all kind of other characters but that seems to be the only one I was able to find.

Who can solve the mystery?

You need to know how the characters were encoded. I would try `UTF-8` instead of your default encoding to start with. — Peter Lawrey, Oct 17 '16 at 11:55

score 3 · Answer 1 · answered Oct 17 '16 at 20:23

Your problem is that you are comparing pears with apples; bytes are not the same as characters. In your code, the character Ÿ is represented in the following ways:

9F (byte encoded using Windows-1252)
178 (char encoded using UTF-16, which is the encoding Java always uses for chars internally)

As a proof of what I'm saying, check this:

String myString = "Caña";
byte[] bbuf = myString.getBytes();     // [ 43, 61, C3, B1, 61 ]   (UTF-8 on my machine)
char[] cbuf = myString.toCharArray();  // [ 43, 61, F1, 61 ]  (Java uses UTF-16 internally)

Now an analysis of your problem:

You took a byte array from a String, I guess by doing this: myString.getBytes() as you didn't specify an encoding, the system is using the default in your machine (Windows-1252)
When you read your bytes into a String using InputSteanReader, etc. there is actually not a problem because you are reading from another (or the same) Windows machine, the problem is when you get the array of chars (instead of an array of bytes) expecting to have the same result (use myString.getBytes() instead of myString.toCharArray() and you'll see your bytes correctly).

Finally, some advice:

Always declare explictly the encoding when you convert between Strings and byte arrays:

byte[] bbuf = myString.getBytes(Charset.forName("UTF-8"));

String myString = new String(bbuf, Charset.forName("UTF-8"));

Never mix chars and bytes, they are not the same

score 0 · Answer 2 · edited May 23 '17 at 12:16

InputStreamReader is going to convert the bytes from the input stream to characters using a character encoding. Since you didn't specify explicitly what character encoding should be used, it's going to use the default character encoding of your system.

How the bytes are converted depends on what character encoding is being used.

If the data is binary data and does not represent text encoded with some character encoding, then using InputStreamReader is the wrong way to read this data.

Java Byte to Char conversion

3 Answers3