Decode error in BufferedReader

Question

I received some data from server and read them from java code :

is = new BufferedInputStream(connection.getInputStream());
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));

int length;
char[] buffer = new char[4096];
StringBuilder sb = new StringBuilder();
while ((length = reader.read(buffer)) != -1) {
   sb.append(new String(buffer, 0, length));//buffer is already incorrect
}

byte[] byteDatas = sb.toString().getBytes();

And I print byteDatas as Hex string:

enter image description here

Comparing to the wireshark's result:

enter image description here

Some bytes are decoded as bf bd ef , I know it's \ufffd(65533) stand for invalid data.

So I think there must have decode error in my code , after debug, I found that If I use connection.getInputStream() to read data directly , there is no invalid data.

So ,the problem must happens in BufferedReader or InputStreamReader, But I have already add "UTF-8" and the data in wireshark seems not very wired. Does UTF-8 is not correctly? Server do not reply the charset.

Please help how to let BufferedReader read the correct data.

UPDATE

My default charset is "UTF-8" and have debug to prove it . After read return , I have already got the wrong data , so it's not String's fault.

You need to specify the charset when you read text input...that is the problem one way or the other. — Jared, Jan 07 '15 at 10:03
I'm thinking that you're *not* reading character streams but byte content hence the issue with the data. — Buhake Sindi, Jan 07 '15 at 10:05
@BuhakeSindi they are byte content, I think this way can also read byte content. — zzy, Jan 07 '15 at 10:21
@zzy not true. Read this [SO answer](http://stackoverflow.com/questions/5764065/the-difference-between-inputstream-and-inputstreamreader-when-reading-multi-byte#answer-5764357) to best understand what I mean. Rather read the `byte` directly and hex encode it instead of using character arrays. — Buhake Sindi, Jan 07 '15 at 10:28
@BuhakeSindi I got it , now i don't know whether the server will reply `character` or not (`byte`), should I always read bytes not char to let data read correctly and let upper to decide whether to convert to `character` ? — zzy, Jan 07 '15 at 11:11
Check the `Content-Type` header. This will tell you if the content is character stream or byte stream and what `charset` the data is encoded in. — Buhake Sindi, Jan 07 '15 at 12:01

icza · Answer 1 · 2015-01-07T10:18:37.880

0

String.getBytes() will use the platform's default encoding (not necessarily UTF-8) to convert the characters of the String to bytes.

Quoting from the javadoc of String.getBytes():

Encodes this String into a sequence of bytes using the platform's default charset...

You can't compare the UTF-8 encoded input data to the result which might not be the result of UTF-8 encoded. Instead explicitly specify the encoding like this:

byte[] byteDatas = sb.toString().getBytes(StandardCharsets.UTF_8);

Note:

If your input data is NOT UTF-8 encoded text and if you attempt to decode it as UTF-8, the decoder may replace invalid byte sequences. This will cause that the bytes you get by encoding the String will not be the same as the input raw bytes.

edited Jan 07 '15 at 10:18

answered Jan 07 '15 at 10:12

icza

389,944
63
907
827

The buffer is already wrong when I got it from `read`. It's not String's mistake. – zzy Jan 07 '15 at 10:20
Then it may be very well that your input is NOT UTF-8 encoded text and if you attempt to decode it as UTF-8, the decoder may replace invalid byte sequences. – icza Jan 07 '15 at 10:21

Decode error in BufferedReader

1 Answers1