dict dump on disk -- fastest to read back

Question

I have a database of around 250,000,000 elements like

{
   "foo": [list of ~500 int],
   "bar": int,
   "x": int,
   "y": int,
   "id": int
}

The point is that the full object does not fits in memory (i.e. no redis)

I can dump each of these 2500000000 small dictionaries in a file whose name is build from the element id and I'm fine using json.read(...).

My constrains are:

I want fast read/parsing (so something else than json ? psjson ?)
the whole database is a key-value store whose values are int and list of int (also maybe string, but I can stick with int only if that matters)

Some hypothesis :

space on disk is not an issue
I have quite large RAM (~64 Go), thus I can fit a non negligible part of the database in memory.
the database is completely static at runtime : nothing will be added, removed or modified.

Questions:

Can I go with disk-dumped small jsons ?
Is there a binary format faster to parse to dict ?
A completely different approach with some "real" database ? (as I said, redis does not work)

score 1 · Answer 1 · answered Feb 02 '22 at 16:28

When you say "space on disk is not an issue", how much To do you have actually ? Because you have 2,5 x10^8 elements, each having ~5 x10^2 ints in foo, taking (up to) 4 bytes each, which gives a very minimal size of 5 x10^11 = 500Go for the data.

I think if you have the storage space for it, SQLite would be a very good solution.
I created a database :

import sqlite3


create_table_element_stmt = """
CREATE TABLE element (
    id INTEGER PRIMARY KEY,
    x INTEGER,
    y INTEGER,
    bar INTEGER
)
"""
create_index_on_element_id_stmt = """
CREATE UNIQUE INDEX index_id ON element(id)
"""
create_table_foo_stmt = """
CREATE TABLE foo (
    element_id INTEGER,
    value INTEGER,
    PRIMARY KEY(element_id, value),
    FOREIGN KEY(element_id) REFERENCES element(id)
)
"""
create_index_on_foo_element_id_stmt = """
CREATE UNIQUE INDEX index_element_id ON foo(element_id, value)
"""

conn = sqlite3.connect("so70947211.db")
conn.execute(create_table_element_stmt)
conn.execute(create_foo_table_stmt)
conn.commit()

and populated it with dummy values :

import itertools

import tqdm  # installed with pip


N_ELEMENTS = 250 * 10**6
# N_FOOS_BY_ELEMENT = 500
N_FOOS_BY_ELEMENT = 5


def yield_elements_no_foo_as_tuple():
    x, y, bar = ord('X'), ord('Y'), hash("bar")  # 88, 89, -3649913608566960221
    yield from ((id, x, y, bar)
                 for id in range(N_ELEMENTS))


def yield_foos_as_tuple():
    foos = [list(range(length+1))
            for length in range(int(floor(N_FOOS_BY_ELEMENT * 0.8)),
                                int(ceil(N_FOOS_BY_ELEMENT * 1.25)))]
    yield from itertools.chain.from_iterable(
        zip(itertools.repeat(id),
            itertools.chain(foo))
        for id, foo in zip(range(N_ELEMENTS), itertools.cycle(foos))
    )


insert_element_stmt = """
INSERT INTO element (id, x, y, bar) VALUES (?, ?, ?, ?)
"""
insert_foo_stmt = """
INSERT INTO foo (element_id, value) VALUES(?, ?)
"""

conn.executemany(insert_element_stmt, tqdm.tqdm(yield_elements_no_foo_as_tuple(), total=N_ELEMENTS))
conn.commit()  # `executemany` is supposed to do it already, but just to be sure

time_before_index1 = time.time()
conn.execute(create_index_on_element_id_stmt)  # create the index AFTER inserting the data
conn.commit()
print(time.time() - time_before_index1)

conn.executemany(insert_foo_stmt, tqdm.tqdm(yield_foos_as_tuple(), total=N_ELEMENTS*N_FOOS_BY_ELEMENT))
conn.commit()  # `executemany` is supposed to do it already, but just to be sure

time_before_index2 = time.time()
conn.execute(create_index_on_foo_element_id_stmt)  # create the index AFTER inserting the data
conn.commit()
print(time.time() - time_before_index2)

conn.close()

which took :

step	duration	resulting DB file size
inserting `element`s	~4 minutes	~5.5 Go
indexing `element`s	~1 minute	~9 Go
inserting 1% of `foo`s	~33 minutes	?
indexing `foo`s	~9 minutes	~83 Go

For example purposes, I decided to insert constant values because it made the process much faster than calling random.randint millions of times. However, I think it is representative of the actual result you would get because SQLite does not compress data by default.
For your actual data, it may be difficult for Python to not be the bottleneck (as it was for me in the very first version I wrote). Also, I inserted only 5 foos for each element instead of ~500, which would have made the process 100 times slower (roughly 48 hours). There are several questions on this site related to bulk-insert performance, I recommend you this one. Also, consider using another language to create the database, it will help things greatly.

Here is the result :

print(list(conn.execute("SELECT COUNT(*) FROM element")))  # [(250000000,)]
print(list(conn.execute("SELECT * FROM element LIMIT 1")))  # [(0, 88, 89, -3649913608566960221)]
print(list(conn.execute("SELECT COUNT(*) FROM foo")))  # [(1250000000,)]
print(list(conn.execute("SELECT * FROM foo LIMIT 100")))  # [(0, 0), (1, 1), (2, 2), (3, 3), ...]
print(list(conn.execute("EXPLAIN QUERY PLAN SELECT * FROM foo JOIN element ON foo.element_id=element.id WHERE element_id=4137")))
# [(3, 0, 0, 'SEARCH TABLE element USING INTEGER PRIMARY KEY (rowid=?)'),
#  (10, 0, 0, 'SEARCH TABLE foo USING COVERING INDEX index_element_id (element_id=?)')]
time_before_req = time.time()
print(list(conn.execute("SELECT * FROM foo JOIN element ON foo.element_id=element.id WHERE element_id=4137")))
# [(4137, 0, 4137, 88, 89, -623841208484394025),
#  (4137, 1, 4137, 88, 89, -623841208484394025),
#  (4137, 2, 4137, 88, 89, -623841208484394025),
#  (4137, 3, 4137, 88, 89, -623841208484394025),
#  (4137, 4, 4137, 88, 89, -623841208484394025)]
print("{:.5f} ms".format((time.time() - time_before_req) * 1000))  # 0.09036 ms

Doing a JOIN between the two large tables took a fraction of millisecond (thanks to indexes).

It is a very performant approach, space- and speed-wise. At least, it will make working with the data very simple with a SQL database.

As to answer precisely your questions :

you can go with disk-dumped dict, but the space-cost will be prohibitive !
```
import json
element0 = {
        "foo": list(range(500)),
        "bar": -3649913608566960221,
        "x": 88,
        "y": 89,
        "id": 0,
    }
with open("element0.json", "wt") as file:
    json.dump(element0, file)
```
produced a ~2.5Kb file (would be larger if foo's ints have more digits), which multiplied by 2.5 x10^8 elements is ~625 To of files. That's a lot of drives ! And that is not counting the overhead. So if you are counting your money, it is possible. But better use some directory hierarchy (for exemple like git using prefixes) otherwise your OS' filesystem driver may choke on a directory this large.

there are many format faster to parse than JSON, so many they even have a wikipedia page ! For exemple using C-structs in Python :

import struct

serial_format_base = (
    "<"  # little-endian
    "i"  # int32 id
    "i"  # int32 x
    "i"  # int32 y
    "i"  # int32 bar
    "i"  # int16 (short) len(foo)
)

def to_pack(element):
    return element["id"], element["x"], element["y"], element["bar"], len(element["foo"]), *element["foo"]

element0 = {
        "foo": list(range(500)),
        "bar": -364991360,
        "x": 88,
        "y": 89,
        "id": 0,
    }

serial_format_for_element0 = serial_format_base + "i" * len(element0["foo"])

with open("element0.bin", "wb") as file:
    file.write(struct.pack(serial_format_for_element0, *to_pack(element0)))

with open("element0.bin", "rb") as file:
    base_bytes = file.read(struct.calcsize(serial_format_base))
    id, x, y, bar, len_foo = struct.unpack_from(serial_format_base, base_bytes)
    foo_format = "i" * len_foo
    foo_bytes = file.read(struct.calcsize(foo_format))
    foo = struct.unpack_from(foo_format, foo_bytes)

element0_from_disk = {
    "foo": list(foo),
    "bar": bar,
    "x": x,
    "y": y,
    "id": id
}
assert element0 == element0_from_disk

As for database, there is my Proof-of-concept using SQLite.
I choose to store native data, but you could also dump your JSON directly, like in this question's answers. Most major DB nowadays can do it too. Although there may be huge costs for operating on it afterwards ... I'll let you benchmark those according to your needs.
Your data is completely normalized, I encourage you to stick to a relational database (i.e. NOT a NoSQL db).

dict dump on disk -- fastest to read back

1 Answers1