0

I received some data from server and read them from java code :

is = new BufferedInputStream(connection.getInputStream());
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));

int length;
char[] buffer = new char[4096];
StringBuilder sb = new StringBuilder();
while ((length = reader.read(buffer)) != -1) {
   sb.append(new String(buffer, 0, length));//buffer is already incorrect
}

byte[] byteDatas = sb.toString().getBytes();

And I print byteDatas as Hex string:

enter image description here

Comparing to the wireshark's result:

enter image description here

Some bytes are decoded as bf bd ef , I know it's \ufffd(65533) stand for invalid data.

So I think there must have decode error in my code , after debug, I found that If I use connection.getInputStream() to read data directly , there is no invalid data.

So ,the problem must happens in BufferedReader or InputStreamReader, But I have already add "UTF-8" and the data in wireshark seems not very wired. Does UTF-8 is not correctly? Server do not reply the charset.

Please help how to let BufferedReader read the correct data.

UPDATE

My default charset is "UTF-8" and have debug to prove it . After read return , I have already got the wrong data , so it's not String's fault.

zzy
  • 1,771
  • 1
  • 13
  • 48
  • You need to specify the charset when you read text input...that is the problem one way or the other. – Jared Jan 07 '15 at 10:03
  • I'm thinking that you're *not* reading character streams but byte content hence the issue with the data. – Buhake Sindi Jan 07 '15 at 10:05
  • @BuhakeSindi they are byte content, I think this way can also read byte content. – zzy Jan 07 '15 at 10:21
  • @zzy not true. Read this [SO answer](http://stackoverflow.com/questions/5764065/the-difference-between-inputstream-and-inputstreamreader-when-reading-multi-byte#answer-5764357) to best understand what I mean. Rather read the `byte` directly and hex encode it instead of using character arrays. – Buhake Sindi Jan 07 '15 at 10:28
  • @BuhakeSindi I got it , now i don't know whether the server will reply `character` or not (`byte`), should I always read bytes not char to let data read correctly and let upper to decide whether to convert to `character` ? – zzy Jan 07 '15 at 11:11
  • Check the `Content-Type` header. This will tell you if the content is character stream or byte stream and what `charset` the data is encoded in. – Buhake Sindi Jan 07 '15 at 12:01

1 Answers1

0

String.getBytes() will use the platform's default encoding (not necessarily UTF-8) to convert the characters of the String to bytes.

Quoting from the javadoc of String.getBytes():

Encodes this String into a sequence of bytes using the platform's default charset...

You can't compare the UTF-8 encoded input data to the result which might not be the result of UTF-8 encoded. Instead explicitly specify the encoding like this:

byte[] byteDatas = sb.toString().getBytes(StandardCharsets.UTF_8);

Note:

If your input data is NOT UTF-8 encoded text and if you attempt to decode it as UTF-8, the decoder may replace invalid byte sequences. This will cause that the bytes you get by encoding the String will not be the same as the input raw bytes.

icza
  • 389,944
  • 63
  • 907
  • 827
  • The buffer is already wrong when I got it from `read`. It's not String's mistake. – zzy Jan 07 '15 at 10:20
  • Then it may be very well that your input is NOT UTF-8 encoded text and if you attempt to decode it as UTF-8, the decoder may replace invalid byte sequences. – icza Jan 07 '15 at 10:21