Fastest and most efficient way to save and load a large dict

Question

I have a problem. I have a huge dict. I want to save and load this huge dict. But unfortunately I got an MemoryError. The dict should not be too big. What is read out of the database is around 4GB. I would now like to save this dict and read it out. However, it should be efficient (not consume much more memory) and not take too long.

What options are there at the moment? I can't get any further with pickle, I get a memory error. I have 200GB of free disk space left.

I looked at Fastest way to save and load a large dictionary in Python and some others questions and blogs.

import pickle
from pathlib import Path

def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
    Path(path).mkdir(parents=True, exist_ok=True)
    pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))

save_file_as_pickle(dict, "dict")

[OUT]

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<timed eval> in <module>

~\AppData\Local\Temp/ipykernel_1532/54965140.py in save_file_as_pickle(file, filename, path)
      1 def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), 'dict')):
      2     Path(path).mkdir(parents=True, exist_ok=True)
----> 3     pickle.dump( file, open( os.path.join(path, str(filename+'.pickle')), "wb" ))

MemoryError:

What worked, but took 1 hour and 26GB space disk is used

with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(dict, f, ensure_ascii=False, indent=4)

I looked up how big my dict is in bytes. I came across this question How to know bytes size of python object like arrays and dictionaries? - The simple way and it shows that the dict is only 8448728 bytes.

import sys
sys.getsizeof(dict)
[OUT] 8448728

What my data looks like (example)

{
'_key': '1',
 'group': 'test',
 'data': {},
 'type': '',
 'code': '007',
 'conType': '1',
 'flag': None,
 'createdAt': '2021',
 'currency': 'EUR',
 'detail': {
        'selector': {
            'number': '12312',
            'isTrue': True,
            'requirements': [{
                'type': 'customer',
                'requirement': '1'}]
            }
        }   

 'identCode': [],
 }

Also, just so you know, `sys.getsizeof(dict)` will only give you the size of the dict itself, not the objects it contains, so this is not a realistic figure for how much memory it's actually using. — juanpa.arrivillaga, Apr 14 '22 at 08:23
GiovanniTardini not heard of it yet. I'll try it right away and. juanpa.arrivillaga thanks for the hint! — Test, Apr 14 '22 at 08:25
@juanpa.arrivillaga saveing the file as `JSON` worked, but with 26GB. — Test, Apr 14 '22 at 08:46
@GiovanniTardini do you know, how I could save this `dict` as `NetCDF`? — Test, Apr 19 '22 at 07:53
Why does it all need to be in memory at the same time? Use a database, eg. [sqlite3](https://docs.python.org/3/library/sqlite3.html). — Peter Wood, Apr 19 '22 at 08:09
This question is hard to answer as we don't know much about the data (size, structure, how is it used?). You could try things like https://pypi.org/project/ujson or databases (SQL or NoSQL). — Feodoran, Apr 19 '22 at 08:11
also - if you have a working setup with your data ina proper database, and even has working code to bring it to an in-memory strucuture: why botter saving this resulting dictionary? Just re-read it from the database. — jsbueno, Apr 19 '22 at 12:35
Two more missing informations: (1) what do you have there is really a dictionary, or is it a list of dicitionaries, where each entry will be like the dict you pasted? If so, is each entry roughly the same size, or is it composed of a few fields with meta-information, and one "data" field with a list of featuring millions of entries like the one above? (2) how much memory (RAM) do you have? — jsbueno, Apr 19 '22 at 12:50

score 4 · Answer 1 · answered Apr 25 '22 at 08:09

The memory error occurs when your RAM (not your hard disk filesystem) cannot hold the serialized form of the dict data. Serialization requires storing all kinds of metadata about the data in keys and values, searching and removing duplicate referenced objects, any properties and attributes of data types (especially database types not part of built-in Python types) all done in RAM memory first before even writing a single byte into the file. Since json produced 26GB just for the data values, I'd have to assume all the metadata added on top of that would have increased the memory size of the serialized form.

Compression doesn't help since the serialized data must be in non-compressed form before doing any compression. It only saves the disk space, not RAM memory.

JSON may have worked because it starts to stream data as it is read, instead of converting to JSON all in memory. Or it could be that JSON form without all the extraneous metadata info can be held in your RAM just fine.

If you want to optimize and solve without using JSON, here is one solution:

Create custom dict reader from database that casts common data types to built-in Python types or your own custom lean data types, rather than what the default database reader provides using its own types.
Create custom serialization/de-serialization method for your data type that only handles data that needs to be stored, and even (de)compress the data on the fly in the (de)serialization method.

A hardware solution is of course to increase your RAM memory and optionally your hard disk.

Another solution is try this in Linux, which tends to have better memory optimization than Windows.

Corbie · Answer 2 · 2022-04-19T12:59:10.130

3

There are two ways to make the pickling more performant:

disabling the Garbage Collector while pickling for a speedup
using gzip to generate a compressed output file

Give this a try:

import gc
import gzip
import os
import pickle
from pathlib import Path


def save_file_as_pickle(file, filename, path=os.path.join(os.getcwd(), "dict")):
    Path(path).mkdir(parents=True, exist_ok=True)
    file_path = os.path.join(path, str(filename + ".pickle"))

    gc.disable()
    try:
        gc.collect()
        with gzip.open(file_path, "wb") as fp:
            pickle.dump(file, fp)
    finally:
        gc.enable()


save_file_as_pickle(my_dict, "dict")

edited Apr 19 '22 at 12:59

answered Apr 19 '22 at 12:24

Corbie

819
9
31

3

Any such pattern should put the `gc.enable` call inside the a `finally` block. – jsbueno Apr 19 '22 at 12:33
1

@jsbueno I have added the try/finally block and and a `gc.collect()` call to my answer. – Corbie Apr 19 '22 at 13:00
1

Thank you for your answer. I got the following error `MemoryError`. – Test Apr 20 '22 at 07:56
1

Would you please post the value, that `objsize.get_deep_size` or `pympler.asizeof.asizeof` returns? – Corbie Apr 20 '22 at 09:18
1

@Corbie sure, but where should I implement `objsize.get_deep_size` or `pympler.asizeof.asizeof` – Test Apr 21 '22 at 08:30
@Test just import the packages and call the functions with your dictionary. See the [link](https://stackoverflow.com/questions/13530762/how-to-know-bytes-size-of-python-object-like-arrays-and-dictionaries-the-simp) you have posted in your own question above. – Corbie Apr 22 '22 at 09:04

score 1 · Answer 3 · answered Apr 25 '22 at 23:35

I would consider trying out some new formats although I am not 100% sure that they are better.

stack overflow answers

For HDF5, I would what to try out dict to hdf5 library to see if it works.

import hdfdict
import numpy as np


d = {
    'a': np.random.randn(10),
    'b': [1, 2, 3],
    'c': 'Hallo',
    'd': np.array(['a', 'b']).astype('S'),
    'e': True,
    'f': (True, False),
}
fname = 'test_hdfdict.h5'
hdfdict.dump(d, fname)
res = hdfdict.load(fname)

print(res)

horseshoe · Answer 4 · 2022-04-26T08:35:29.683

If nothing else works you might consider to split the dataset and save it in chunks. You can use threading or you can rewrite the code below to do it serial. I assumed that your dictionary is a list of dictionaries if its a dictionary of dictionaries you need to adjust the code accordingly. Als note that this example also needs to be adjusted as, depending on how you chose the step size, the last entries might not be saved or loaded.

import pickle
import threading
    
# create a huge list of dicts
size = 1000000
mydict_list = [{'_key':f'{i}','group': 'test'} for i in range(size)]

# try to save it as full file just to see how large it is
#with open(f'whole_list.pkl', 'wb') as f:
#    pickle.dump(mydict_list, f)


# define function to save the smaller parts
def savedata(istart,iend):
    tmp = mydict_list[istart:iend]
    with open(f'items_{istart}_{iend}.pkl', 'wb') as f:
        pickle.dump(tmp, f)

# define function to load the smaller parts
def loaddata(istart,iend):
    tmp = mydict_list[istart:iend]
    with open(f'items_{istart}_{iend}.pkl', 'rb') as f:
        results[f'{istart}_{iend}'] = pickle.load(f)

# define into how many chunks you want to split the file
steps = int(size/10)

# split the list and save it using threading
results = {}
threads={}
for i in  [i for i in range(0,len(mydict_list),steps)]:
    threads[i]=None

for i in [i for i in  range(0,len(mydict_list),steps)]:
    print(f'processing: {i,i+steps}')
    threads[i] = threading.Thread(target=savedata, args=(i,i+steps,))
    threads[i].start()

for i in [i for i in range(0,len(mydict_list),steps)]:
    threads[i].join()


# load the list using threading
threads={}
for i in  [i for i in range(0,len(mydict_list),steps)]:
    threads[i]=None

for i in [i for i in  range(0,len(mydict_list),steps)]:
    print(f'processing: {i,i+steps}')
    threads[i] = threading.Thread(target=loaddata, args=(i,i+steps,))
    threads[i].start()

for i in [i for i in range(0,len(mydict_list),steps)]:
    threads[i].join()

mork · Answer 5 · 2022-04-26T19:47:52.197

TL;DR

The main issue here is the lack of streaming-like data format. I recommend reading and writing jsonl format, but keep working with your regular dict. Try the 2 options:

gzip + jsonl, using the file api (faster write)
clear jsonl, using the mmap api (faster read)

Full details below:

JSON Lines Format

The idea is to provide as close as posible format to json, while staying split-able.

This allows for a line by line, or a block(of lines) by block multiprocessing, locally or distributed. (a common big-date practice might be storign on HDFS, processing by Spark, for ex').

It goes nicely with the gzip compression, which is split-friendly by itself - allowing for sequential reads and writes.

We'll wrap the read and write so that the application will be agnostic to it, and could still deal with the common dict.

A data simulator

I created 1M dict entries from your sample, with varying keys, currency and year (to challenge the gzip compression a bit). I used a macbook pro m1.

import json
import gzip
import mmap
import subprocess

d = {}
years = { 0: 2019, 1: 2020, 2:2121 }
currencies = { 0: 'EUR', 1: 'USD', 2: 'GBP' }
n = int(1e6)

for i in range(n):
    rem = i % 3
    d[i] = {
        '_key': str(i),
        'group': 'test',
        'data': {},
        'type': '',
        'code': '007',
        'conType': '1',
        'flag': None,
        'createdAt': years[rem],
        'currency': currencies[rem],
        'detail': {
            'selector': {
                'number': '12312',
                'isTrue': True,
                'requirements': [{
                    'type': 'customer',
                    'requirement': '1'}]
                }
            },
        'identCode': [],
    }

Option #1 - gzip file api

For the 1M dataset it took ~10s to write and ~6s to read again.

file_name_jsonl_gz = './huge_dict.jsonl.gz'

# write
with gzip.open(file_name_jsonl_gz, 'wt') as f:
    for k, v in d.items():
        f.write(f'{{"{k}":{json.dumps(v)}}}\n') # from k, v pair into a json line

# read again
_d = {}
with gzip.open(file_name_jsonl_gz, 'rt') as f:
    for line in f:
        __d = json.loads(line)
        k, v = tuple(__d.items())[0] # from a single json line into k, v pair
        _d[k] = v

# test integrity
json.dumps(d) == json.dumps(_d)

True

Option #2 - mmap api

For the 1M dataset it took ~5s to write and ~8s to read again.

The Memory Mapped File is a potentially very strong technique for robust-ing our IO. The basic idea is mapping [huge] files into the virtual-memory system, allowing partial / fast / concurrent reads and writes. So, good for both huge files (that can't be fitted into memory) and a performance boost.

The code is more cumbersome, and not always the fastest, but you can further tweak it for your needs. There are so many details about it, so I recommend reading more on wiki and python api, not to overwhelm the answer here.

file_name_mmap_jsonl = './huge_dict_mmap.jsonl'
# an initial large empty file (hard to estimate in advance)
# change the size for your actual needs
subprocess.Popen(['truncate', '-s', '10G', file_name_mmap_jsonl])

pos_counter = 0
with open(file_name_mmap_jsonl, mode='r+', encoding="utf-8") as f:
    # mmap gets its file descriptor from the file object
    with mmap.mmap(fileno=f.fileno(), length=0, access=mmap.ACCESS_WRITE) as mm:
        buffer = []
        for k, v in d.items():
            s = f'{{"{k}":{json.dumps(v)}}}\n' # from k, v pair into a json line
            b = s.encode()
            buffer.append(b)
            pos_counter += len(b)

            # using buffer; not to abuse the write for every line
            # try and tweak it further
            if len(buffer) >= 100:
                mm.write(b''.join(buffer))
                buffer = []
        
        mm.write(b''.join(buffer))
        mm.flush()

# shrink to the excat needed size
subprocess.Popen(['truncate', '-s', str(pos_counter), file_name_mmap_jsonl])
# read again
_d = {}
with open(file_name_mmap_jsonl, mode='r+', encoding="utf-8") as f:
    with mmap.mmap(fileno=f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
        while True:
            line = mm.readline()
            if len(line) == 0: # EOF
                break
            __d = json.loads(line)
            k, v = tuple(__d.items())[0] # from a json line into k, v pair
            _d[k] = v

# test integrity
json.dumps(d) == json.dumps(_d)

True

There was also a 3rd Option: mmap + gzip, but the write was slow and there were issues with decompressing back the lines. I recommend pursuing this, though - this will allow for a much smaller file size on disk.

Fastest and most efficient way to save and load a large dict

5 Answers5