When you say "space on disk is not an issue", how much To do you have actually ? Because you have 2,5 x10^8 elements, each having ~5 x10^2 ints in foo
, taking (up to) 4 bytes each, which gives a very minimal size of 5 x10^11 = 500Go for the data.
I think if you have the storage space for it, SQLite would be a very good solution.
I created a database :
import sqlite3
create_table_element_stmt = """
CREATE TABLE element (
id INTEGER PRIMARY KEY,
x INTEGER,
y INTEGER,
bar INTEGER
)
"""
create_index_on_element_id_stmt = """
CREATE UNIQUE INDEX index_id ON element(id)
"""
create_table_foo_stmt = """
CREATE TABLE foo (
element_id INTEGER,
value INTEGER,
PRIMARY KEY(element_id, value),
FOREIGN KEY(element_id) REFERENCES element(id)
)
"""
create_index_on_foo_element_id_stmt = """
CREATE UNIQUE INDEX index_element_id ON foo(element_id, value)
"""
conn = sqlite3.connect("so70947211.db")
conn.execute(create_table_element_stmt)
conn.execute(create_foo_table_stmt)
conn.commit()
and populated it with dummy values :
import itertools
import tqdm # installed with pip
N_ELEMENTS = 250 * 10**6
# N_FOOS_BY_ELEMENT = 500
N_FOOS_BY_ELEMENT = 5
def yield_elements_no_foo_as_tuple():
x, y, bar = ord('X'), ord('Y'), hash("bar") # 88, 89, -3649913608566960221
yield from ((id, x, y, bar)
for id in range(N_ELEMENTS))
def yield_foos_as_tuple():
foos = [list(range(length+1))
for length in range(int(floor(N_FOOS_BY_ELEMENT * 0.8)),
int(ceil(N_FOOS_BY_ELEMENT * 1.25)))]
yield from itertools.chain.from_iterable(
zip(itertools.repeat(id),
itertools.chain(foo))
for id, foo in zip(range(N_ELEMENTS), itertools.cycle(foos))
)
insert_element_stmt = """
INSERT INTO element (id, x, y, bar) VALUES (?, ?, ?, ?)
"""
insert_foo_stmt = """
INSERT INTO foo (element_id, value) VALUES(?, ?)
"""
conn.executemany(insert_element_stmt, tqdm.tqdm(yield_elements_no_foo_as_tuple(), total=N_ELEMENTS))
conn.commit() # `executemany` is supposed to do it already, but just to be sure
time_before_index1 = time.time()
conn.execute(create_index_on_element_id_stmt) # create the index AFTER inserting the data
conn.commit()
print(time.time() - time_before_index1)
conn.executemany(insert_foo_stmt, tqdm.tqdm(yield_foos_as_tuple(), total=N_ELEMENTS*N_FOOS_BY_ELEMENT))
conn.commit() # `executemany` is supposed to do it already, but just to be sure
time_before_index2 = time.time()
conn.execute(create_index_on_foo_element_id_stmt) # create the index AFTER inserting the data
conn.commit()
print(time.time() - time_before_index2)
conn.close()
which took :
step |
duration |
resulting DB file size |
inserting element s |
~4 minutes |
~5.5 Go |
indexing element s |
~1 minute |
~9 Go |
inserting 1% of foo s |
~33 minutes |
? |
indexing foo s |
~9 minutes |
~83 Go |
For example purposes, I decided to insert constant values because it made the process much faster than calling random.randint
millions of times. However, I think it is representative of the actual result you would get because SQLite does not compress data by default.
For your actual data, it may be difficult for Python to not be the bottleneck (as it was for me in the very first version I wrote). Also, I inserted only 5 foo
s for each element
instead of ~500, which would have made the process 100 times slower (roughly 48 hours). There are several questions on this site related to bulk-insert performance, I recommend you this one. Also, consider using another language to create the database, it will help things greatly.
Here is the result :
print(list(conn.execute("SELECT COUNT(*) FROM element"))) # [(250000000,)]
print(list(conn.execute("SELECT * FROM element LIMIT 1"))) # [(0, 88, 89, -3649913608566960221)]
print(list(conn.execute("SELECT COUNT(*) FROM foo"))) # [(1250000000,)]
print(list(conn.execute("SELECT * FROM foo LIMIT 100"))) # [(0, 0), (1, 1), (2, 2), (3, 3), ...]
print(list(conn.execute("EXPLAIN QUERY PLAN SELECT * FROM foo JOIN element ON foo.element_id=element.id WHERE element_id=4137")))
# [(3, 0, 0, 'SEARCH TABLE element USING INTEGER PRIMARY KEY (rowid=?)'),
# (10, 0, 0, 'SEARCH TABLE foo USING COVERING INDEX index_element_id (element_id=?)')]
time_before_req = time.time()
print(list(conn.execute("SELECT * FROM foo JOIN element ON foo.element_id=element.id WHERE element_id=4137")))
# [(4137, 0, 4137, 88, 89, -623841208484394025),
# (4137, 1, 4137, 88, 89, -623841208484394025),
# (4137, 2, 4137, 88, 89, -623841208484394025),
# (4137, 3, 4137, 88, 89, -623841208484394025),
# (4137, 4, 4137, 88, 89, -623841208484394025)]
print("{:.5f} ms".format((time.time() - time_before_req) * 1000)) # 0.09036 ms
Doing a JOIN between the two large tables took a fraction of millisecond (thanks to indexes).
It is a very performant approach, space- and speed-wise. At least, it will make working with the data very simple with a SQL database.
As to answer precisely your questions :
you can go with disk-dumped dict
, but the space-cost will be prohibitive !
import json
element0 = {
"foo": list(range(500)),
"bar": -3649913608566960221,
"x": 88,
"y": 89,
"id": 0,
}
with open("element0.json", "wt") as file:
json.dump(element0, file)
produced a ~2.5Kb file (would be larger if foo's int
s have more digits), which multiplied by 2.5 x10^8 elements is ~625 To of files. That's a lot of drives ! And that is not counting the overhead. So if you are counting your money, it is possible. But better use some directory hierarchy (for exemple like git using prefixes) otherwise your OS' filesystem driver may choke on a directory this large.
there are many format faster to parse than JSON, so many they even have a wikipedia page ! For exemple using C-structs in Python :
import struct
serial_format_base = (
"<" # little-endian
"i" # int32 id
"i" # int32 x
"i" # int32 y
"i" # int32 bar
"i" # int16 (short) len(foo)
)
def to_pack(element):
return element["id"], element["x"], element["y"], element["bar"], len(element["foo"]), *element["foo"]
element0 = {
"foo": list(range(500)),
"bar": -364991360,
"x": 88,
"y": 89,
"id": 0,
}
serial_format_for_element0 = serial_format_base + "i" * len(element0["foo"])
with open("element0.bin", "wb") as file:
file.write(struct.pack(serial_format_for_element0, *to_pack(element0)))
with open("element0.bin", "rb") as file:
base_bytes = file.read(struct.calcsize(serial_format_base))
id, x, y, bar, len_foo = struct.unpack_from(serial_format_base, base_bytes)
foo_format = "i" * len_foo
foo_bytes = file.read(struct.calcsize(foo_format))
foo = struct.unpack_from(foo_format, foo_bytes)
element0_from_disk = {
"foo": list(foo),
"bar": bar,
"x": x,
"y": y,
"id": id
}
assert element0 == element0_from_disk
As for database, there is my Proof-of-concept using SQLite.
I choose to store native data, but you could also dump your JSON directly, like in this question's answers. Most major DB nowadays can do it too. Although there may be huge costs for operating on it afterwards ... I'll let you benchmark those according to your needs.
Your data is completely normalized, I encourage you to stick to a relational database (i.e. NOT a NoSQL db).