Why is a BSON serialized numpy array much bigger than the original?

Question

I'm working with images in numpy array form. I need to serialize/deserialize them to/from JSON (I'm using MongoDB)

numpy arrays cannot be serialized with json.dump; I am aware of this but I wonder if there is a better way, since the conversion of a bytes numpy array to BSON multiplies the number of bytes by almost 12 (I don't understand why):

import numpy as np
import bson
from io import StringIO as sio
RC = 500
npdata = np.zeros(shape=(RC,RC,3), dtype='B')
rows, cols, depth = npdata.shape
npsize = rows*cols*depth
npdata=npdata.reshape((npsize,))
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})
lb = len(bsondata)
print(lb, npsize, lb/npsize) 

> 8888926 750000 11.851901333333334

The question states that you need to serialize to/from JSON, but then you use BSON. Which one do you really need? Or are you looking for *any* way to efficiently serialize arrays? Please clarify. — MB-F, Sep 28 '17 at 07:27
You communicate with MongoDB in JSON, but MongoDB uses BSON behind the scenes. What I'm looking for is a reasonable BSON size. — Eduardo, Sep 28 '17 at 08:30

score 4 · Accepted Answer · answered Sep 28 '17 at 09:48

The reason for this increased number of bytes is how BSON saves the data. You can find this information in the BSON specification, but let's look at a concrete example:

import numpy as np
import bson

npdata = np.arange(5, dtype='B') * 11
listdata = npdata.tolist()
bsondata = bson.BSON.encode({"rows": rows, "cols": cols, "data": listdata})

print([hex(b) for b in bsondata])

Here, we store an array with values [0, 11, 22, 33, 44, 55] as BSON and print the resulting binary data. Below I have annotated the result to explain what's going on:

['0x47', '0x0', '0x0', '0x0',  # total number of bytes in the document
 # First element in document
     '0x4',  # Array
     '0x64', '0x61', '0x74', '0x61', '0x0',  # key: "data"
     # subdocument (data array)
         '0x4b',  '0x0', '0x0', '0x0',  # total number of bytes
         # first element in data array
             '0x10',                        # 32 bit integer
             '0x30', '0x0',                 # key: "0"
             '0x0', '0x0', '0x0', '0x0',    # value: 0
         # second element in data array
             '0x10',                        # 32 bit integer
             '0x31', '0x0',                 # key: "1"
             '0xb', '0x0', '0x0', '0x0',    # value: 11
         # third element in data array
             '0x10',                        # 32 bit integer
             '0x32', '0x0',                 # key: "2"
             '0x16', '0x0', '0x0', '0x0',   # value: 22             
 # ...
]

In addition to some format overhead, each value of the array is rather wastefully encoded with 7 bytes: 1 byte to specify the data type, 2 bytes for a string containing the index (three bytes for indices >=10, four bytes for indices >=100, ...) and 4 bytes for the 32 bit integer value.

This at least explains why the BSON data is so much bigger than the original array.

I found two libraries GitHub - mongodb/bson-numpy and GitHub - ajdavis/bson-numpy which may do a better job of encoding numby arrays in BSON. However, I did not try them, so I can't say if that is the case or if they even work correctly.

"2 bytes for a string containing the index (three bytes for indices >=10, four bytes for indices >=100, ...)" Thanks! The need to add this to the spec. — dustinevan, Oct 31 '20 at 05:09

Why is a BSON serialized numpy array much bigger than the original?

1 Answers1