Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length

Question

Say my data looks like this

thisList = [
     [[13, 43, 21, 4], [33, 2, 111, 33332, 23, 43, 2, 2], [232, 2], [23, 11]] ,
     [[21, 2233, 2], [2, 3, 2,1, 32, 22], [3]], 
     [[3]], 
     [[23, 12], [55, 3]],
     ....
]

What is the most space-efficient way to store this time of data?

I looked at Numpy files, but numpy only supports uniform length data

I looked at Hdf5, which has support for 1d ragged tensors, but not 2d

https://stackoverflow.com/a/42659049/3259896

So there's an option of creating a separate hdf5 file for every list in thisList, but I would have potentially 10-20 million those lists.

If you want to go with hdf5, you don't have to create multiple files. You can instead create multiple datasets within a single hdf5 file. Another option could be to save to JSON or BSON (and compress). — jkr, May 08 '20 at 02:49
I have never worked with it, but it allows for faster encoding/decoding of objects because data are not converted to strings as they are in JSON. MongoDB, for example, uses BSON. It is not human-readable though because it is a binary format. Unless, of course, you can read binary. — jkr, May 08 '20 at 02:54
It seems that I can only encode dictionaries with Bson, which wouldn't be too hard to convert the data, but I am wondering if it's worth encoding all the extra keys. — SantoshGupta7, May 08 '20 at 08:45
The other option is to save to JSON and compress the file. That should be OK if you're more concerned with storage size rather than speed to read/write. — jkr, May 08 '20 at 13:55

score 4 · Accepted Answer · answered May 08 '20 at 15:00

I ran benchmarks saving a ragged nested list with JSON, BSON, Numpy, and HDF5.

TLDR: use compressed JSON, because it is the most space efficient and easiest to encode/decode.

On the synthetic data, here are the results (with du -sh test*):

4.5M    test.json.gz
7.5M    test.bson.gz
8.5M    test.npz
261M    test_notcompressed.h5
1.3G    test_compressed.h5

Compressed JSON is the most efficient in terms of storage, and it is also the easiest to encode and decode because the ragged list does not have to be converted to a mapping. BSON comes in second, but it has to be converted to a mapping, which complicates encoding and decoding (and negating the encoding/decoding speed benefits of BSON over JSON). Numpy's compressed NPZ format is third best, but like BSON, the ragged list must be made into a dictionary before saving. HDF5 is surprisingly large, especially compressed. This is probably because there are many different datasets, and compression adds overhead to each dataset.

Benchmarks

Here is the relevant code for the benchmarking. The bson package is part of pymongo. I ran these benchmarks on a Debian Buster machine with an ext4 filesystem.

def get_ragged_list(length=100000):
    """Return ragged nested list."""
    import random

    random.seed(42)
    l = []
    for _ in range(length):
        n_sublists = random.randint(1, 9)
        sublist = []
        for i in range(n_sublists):
            subsublist = [random.randint(0, 1000) for _ in range(random.randint(1, 9))]
            sublist.append(subsublist)
        l.append(sublist)
    return l

def save_json_gz(obj, filepath):
    import gzip
    import json

    json_str = json.dumps(obj)
    json_bytes = json_str.encode()
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(json_bytes)

def save_bson(obj, filepath):
    import gzip
    import bson

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    b = bson.BSON.encode(d)
    with gzip.GzipFile(filepath, mode="w") as f:
        f.write(b)

def save_numpy(obj, filepath):
    import numpy as np

    d = {}
    for ii, n in enumerate(obj):
        for jj, nn in enumerate(n):
            key = f"{ii}/{jj}"
            d[key] = nn
    np.savez_compressed(filepath, d)

def save_hdf5(obj, filepath, compression="lzf"):
    import h5py

    with h5py.File(filepath, mode="w") as f:
        for ii, n in enumerate(obj):
            for jj, nn in enumerate(n):
                name = f"{ii}/{jj}"
                f.create_dataset(name, data=nn, compression=compression)

ragged = get_ragged_list()

save_json_gz(ragged, "ragged.json.gz")
save_bson(ragged, "test.bson.gz")
save_numpy(ragged, "ragged.npz")
save_hdf5(ragged, "test_notcompressed.h5", compression=None)
save_hdf5(ragged, "test_compressed.h5", compression="lzf")

Versions of relevant packages:

python 3.8.2 | packaged by conda-forge | (default, Mar 23 2020, 18:16:37) [GCC 7.3.0]
pymongo bson 3.10.1
numpy 1.18.2
h5py 2.10.0

Space efficient data store for list of list of lists. Elements are integers, and size of all lists varies in length

1 Answers1

Benchmarks