I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.
TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode.
On the synthetic data, here are the results (with du -sh test*
):
4.5M test.json.gz
7.5M test.bson.gz
8.5M test.npz
261M test_notcompressed.h5
1.3G test_compressed.h5
Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping. BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. HDF5 is surprisingly large, especially compressed. This is probably because there are many different datasets, and compression adds overhead to each dataset.
Benchmarks
Here is the relevant code for the benchmarking. The bson
package is part of pymongo
. I ran these benchmarks on a Debian Buster machine with an ext4
filesystem.
def get_ragged_list(length=100000):
"""Return ragged nested list."""
import random
random.seed(42)
l = []
for _ in range(length):
n_sublists = random.randint(1, 9)
sublist = []
for i in range(n_sublists):
subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
sublist.append(subsublist)
l.append(sublist)
return l
def save_json_gz(obj, filepath):
import gzip
import json
json_str = json.dumps(obj)
json_bytes = json_str.encode()
with gzip.GzipFile(filepath, mode="w") as f:
f.write(json_bytes)
def save_bson(obj, filepath):
import gzip
import bson
d = {}
for ii, n in enumerate(obj):
for jj, nn in enumerate(n):
key = f"{ii}/{jj}"
d[key] = nn
b = bson.BSON.encode(d)
with gzip.GzipFile(filepath, mode="w") as f:
f.write(b)
def save_numpy(obj, filepath):
import numpy as np
d = {}
for ii, n in enumerate(obj):
for jj, nn in enumerate(n):
key = f"{ii}/{jj}"
d[key] = nn
np.savez_compressed(filepath, d)
def save_hdf5(obj, filepath, compression="lzf"):
import h5py
with h5py.File(filepath, mode="w") as f:
for ii, n in enumerate(obj):
for jj, nn in enumerate(n):
name = f"{ii}/{jj}"
f.create_dataset(name, data=nn, compression=compression)
ragged = get_ragged_list()
save_json_gz(ragged, "ragged.json.gz")
save_bson(ragged, "test.bson.gz")
save_numpy(ragged, "ragged.npz")
save_hdf5(ragged, "test_notcompressed.h5", compression=None)
save_hdf5(ragged, "test_compressed.h5", compression="lzf")
Versions of relevant packages:
python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
pymongo bson 3.10.1
numpy 1.18.2
h5py 2.10.0