2

I am trying to understand BSON via http://bsonspec.org/#/specification, but still some questions remain.


let's take an example from the web site above:

{"hello": "world"} → "\x16\x00\x00\x00\x02hello\x00\x06\x00\x00\x00world\x00\x00"

Question 1

in the above example, for the encoded bytes results, the double quotes actually are not part of the results, right?

Question 2

I understand that the first 4 bytes \x16\x00\x00\x00 is the size of the whole BSON doc.

And it is little endian format. But why? Why not take big endian?

Question 3

How comes the size of the example doc being \x16, i.e. 22?

Question 4

Normally, if I want to encode the doc by myself, how do I calculate the size of the doc? I think my trouble majorly is how to decide the size of UTF-8 string?


Let's take another example:

{"BSON": ["awesome", 5.05, 1986]}   

→   

"\x31\x00\x00\x00\x04BSON\x00\x26\x00\x00\x00\x020\x00\x08\x00\x00 
 \x00awesome\x00\x011\x00\x33\x33\x33\x33\x33\x33\x14\x40\x102\x00\xc2\x07\x00\x00 
 \x00\x00"

Question 5

In this example, there is an array. according to the specification, for array, it is actually a list of {key, value} pairs, whereas the key is 0, 1, etc. My question is so the 0, 1 here are strings too, right?

Jackson Tale
  • 25,428
  • 34
  • 149
  • 271

1 Answers1

2

Question 1

in the above example, for the encoded bytes results, the double quotes actually are not part of the results, right?

The quotes are not part of the strings. They're used to mark JSON strings

Question 2

And it is little endian format. But why? Why not take big endian?

Choice of endianness is largely a matter of preference. One advantage of little endian is that commonly used platforms are little endian, and thus don't need to reverse the bytes.

Question 3

How comes the size of the example doc being \x16, i.e. 22?

There are 22 bytes (including the length prefix)

Question 4

Normally, if I want to encode the doc by myself, how do I calculate the size of the doc? I think my trouble majorly is how to decide the size of UTF-8 string?

First write out the document, and then go back to fill in the length.

Question 5

n this example, there is an array. according to the specification, for array, it is actually a list of {key, value} pairs, whereas the key is 0, 1, etc. My question is so the 0, 1 here are strings too, right?

Yes. Zero terminated strings without length prefix to be exact. (Called cstring in the list). Just like an embedded document.

Community
  • 1
  • 1
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • for question 3, could you please count for me? How many byte does UTF8 char occupy? – Jackson Tale Apr 23 '13 at 13:14
  • @JacksonTale A UTF8 codeunit needs one byte. In your case there are 10 bytes for the strings themselves, 3 null terminators, 1 type marker and 2*4 length specifiers for a total of 22. Just look at the example string. – CodesInChaos Apr 23 '13 at 13:53
  • hi, according http://stackoverflow.com/questions/5290182/how-many-bytes-takes-one-unicode-character and http://stackoverflow.com/questions/10229156/how-many-characters-are-there-in-utf-8, UTF8 codeunit needs more than one byte (1-4)?http://stackoverflow.com/questions/10229156/how-many-characters-are-there-in-utf-8 – Jackson Tale Apr 23 '13 at 14:51
  • When encoding a Unicode *codepoint* as UTF-8 you get 1-4 *codeunits*. A UTF-8 codeunit is 8 bits by definition. For codepoints in the ASCII range (0-127) a codepoint will produce a single UTF-8 codeunit, for higher codepoints it will need more. For extra fun you sometimes need multiple codepoints that form a single rendered symbol and lots of extra unicode complexity. – CodesInChaos Apr 23 '13 at 14:56
  • so in this case, how can I calculate the length of the string if they are not in ASCII? – Jackson Tale Apr 23 '13 at 15:06
  • @JacksonTale Depends on your programming language. Different languages represent strings differently. Some use UTF-8, some UTF-16, some don't specify it at all... Many have built in library functions for this. Else you need to manually iterate through the string computing the UTF-8 size for a given codepoint. I know how to handle it in .net, but you'll need to figure out how your own language works. – CodesInChaos Apr 23 '13 at 15:08