0

I have a byte array byteObj which is serialized by BSON.

String strObj = new String(byteObj)
System.out.println(byteObj.length)
System.out.println(strObj.getBytes().length)

The result is 152 and 154. And these two byte arrays are not the same. How can I recover the original bson byte array from the string?

update:

152 154
[-104, 0, 0, 0, 4, 116, 105, 116, 108, 101, 0, 80, 0, 0, 0, 2, 48, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 49, 0, 3, 0, 0, 0, 105, 115, 0, 2, 50, 0, 2, 0, 0, 0, 97, 0, 2, 51, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 52, 0, 2, 0, 0, 0, 97, 0, 2, 53, 0, 3, 0, 0, 0, 105, 115, 0, 2, 54, 0, 6, 0, 0, 0, 116, 105, 116, 108, 101, 0, 0, 4, 99, 111, 110, 116, 101, 110, 116, 0, 51, 0, 0, 0, 2, 48, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 49, 0, 2, 0, 0, 0, 97, 0, 2, 50, 0, 8, 0, 0, 0, 99, 111, 110, 116, 101, 110, 116, 0, 2, 51, 0, 3, 0, 0, 0, 105, 115, 0, 0, 0]
[-17, -65, -67, 0, 0, 0, 4, 116, 105, 116, 108, 101, 0, 80, 0, 0, 0, 2, 48, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 49, 0, 3, 0, 0, 0, 105, 115, 0, 2, 50, 0, 2, 0, 0, 0, 97, 0, 2, 51, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 52, 0, 2, 0, 0, 0, 97, 0, 2, 53, 0, 3, 0, 0, 0, 105, 115, 0, 2, 54, 0, 6, 0, 0, 0, 116, 105, 116, 108, 101, 0, 0, 4, 99, 111, 110, 116, 101, 110, 116, 0, 51, 0, 0, 0, 2, 48, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 49, 0, 2, 0, 0, 0, 97, 0, 2, 50, 0, 8, 0, 0, 0, 99, 111, 110, 116, 101, 110, 116, 0, 2, 51, 0, 3, 0, 0, 0, 105, 115, 0, 0, 0]

First is the BSON byte array.

update 2: the test code

    BSONObject ob = new BasicBSONObject()
            .append("title", Arrays.asList(new String[]{"this", "is", "a", "this", "a", "is", "title"}))
            .append("content", Arrays.asList(new String[]{"this", "a", "content", "is"}));


    byte[] ahaha = BSON.encode(ob);
    BSON.decode(ahaha);

    // BSON.decode(new String(ahaha).getBytes());

    byte[] strByte = new String(ahaha).getBytes();

    System.out.println(ahaha.length + "\t" + strByte.length);
    System.out.println(Arrays.toString(ahaha));
    System.out.println(Arrays.toString(strByte));

See How do you convert binary data to Strings and back in Java? for the solution of convert binary data to string and vice versa.

Community
  • 1
  • 1
Tilney
  • 318
  • 2
  • 17
  • your original byteObj array should be of length 154... – assylias May 27 '15 at 08:36
  • Are you sure that you Byte array is a text ? I mean, maybe the Byte array is an image and the bytes cannot be converted to valid characters in a String : that's why you have a difference of length. – romfret May 27 '15 at 08:37
  • Any reason for this Down vote? Do you even understand the question? – Tilney May 27 '15 at 08:37
  • No, if the size of the byte array is some, say less than 130, these two is the same, When larger than some threshold, It fails. – Tilney May 27 '15 at 08:38
  • Could you somehow post the byte array? Just a guess: maybe this is an issue with the byte order mark. I think this is `U+FEFF` at the beginning. – chris May 27 '15 at 08:39

2 Answers2

1

Reason for the difference is conversion of bytes to String. Note that the first byte is negative. Here is explanation from Javadoc:

The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array. The behavior of this constructor when the given bytes are not valid in the default charset is unspecified.

The CharsetDecoder class should be used when more control over the decoding process is required.

Würgspaß
  • 4,660
  • 2
  • 25
  • 41
  • The length of the String is not the problem. He prints out `strObj.getBytes().length`, which means the length of the bytes, not the count of characters. He expects (and me too) that the result should be the same as the length of the byte array given to construct the `String`. If not specified the default character set is used for both conversions. – chris May 27 '15 at 08:55
  • I would _not_ expect anything, if I read something like this in the documentation: The behavior of this method when this string cannot be encoded in the default charset is unspecified. – Würgspaß May 27 '15 at 09:00
  • I don't think there is any connection between serialization form and charset, so this ```CharsetDecoder``` can not ensure the consistency of the byte array either. – Tilney May 27 '15 at 09:09
  • 1
    That's the issue. The output of `BSON.encode()` is binary data, not text encoded in your default character set, which is what `new String(byte[] b)` expects as input. These two encoding and decoding methods don't go together, and you're getting an encoding error, which is what that Javadoc is warning you about. If you want to decode a BSON-encoded byte sequence, you need to use something like `BSON.decode`. – Andrew Janke May 27 '15 at 09:11
  • Conversion from _MongoDB/BSON/String_ is an entirely different question. You may end up using a third party solution or code a solution of your own. But clearly, `String(byte[])` and `String.getBytes()` will not work. They are simply not intended for problems like this. – Würgspaß May 27 '15 at 09:16
0

I cannot reproduce the problem. The following code return the same length (152) and bytes are the same :

byte[] bs = {-104, 0, 0, 0, 4, 116, 105, 116, 108, 101, 0, 80, 0, 0, 0, 2, 48, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 49, 0, 3, 0, 0, 0, 105, 115, 0, 2, 50, 0, 2, 0, 0, 0, 97, 0, 2, 51, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 52, 0, 2, 0, 0, 0, 97, 0, 2, 53, 0, 3, 0, 0, 0, 105, 115, 0, 2, 54, 0, 6, 0, 0, 0, 116, 105, 116, 108, 101, 0, 0, 4, 99, 111, 110, 116, 101, 110, 116, 0, 51, 0, 0, 0, 2, 48, 0, 5, 0, 0, 0, 116, 104, 105, 115, 0, 2, 49, 0, 2, 0, 0, 0, 97, 0, 2, 50, 0, 8, 0, 0, 0, 99, 111, 110, 116, 101, 110, 116, 0, 2, 51, 0, 3, 0, 0, 0, 105, 115, 0, 0, 0};

System.out.println(new String(bs).getBytes().length);
System.out.println(bs.length);
romfret
  • 391
  • 2
  • 11
  • I update the code. you may need mongo db java driver to get it work. – Tilney May 27 '15 at 09:01
  • I just copy/paste your new code. Results are correct for me ! `152 152` `[-104, 0, ...]` `[-104, 0, ...]` I used BSON version 2.3 – romfret May 27 '15 at 09:08
  • Could you please tell me your workspace environment? I'm on Ubuntu 12 with mongo db driver 3.0.1. – Tilney May 27 '15 at 09:12
  • Also check your locales, so you know what encoding `String(byte[])` is expecting. – Andrew Janke May 27 '15 at 09:13
  • My Locale is "FR_fr" but it works perfectly with "en" too. I am not using MongoDB drivers but only `org.mongodb.bson` version 2.3 as a Maven dependency. – romfret May 27 '15 at 09:22
  • 1
    @romfret: What is your default character set? You can get it with this code: `System.out.println(Charset.defaultCharset().displayName());`. I got different sizes with `UTF-8` but the same count of bytes when using `windows-1252`. – chris May 27 '15 at 09:27
  • That's it. Mine is `windows-1252`. You spot the problem ;) – romfret May 27 '15 at 09:32
  • http://stackoverflow.com/questions/20778/how-do-you-convert-binary-data-to-strings-and-back-in-java this could be the final solution. And ```windows-1252``` don't work for me. – Tilney May 27 '15 at 09:35