How to save and load a large dictionary to storage in python?

Question

I have a 1.5GB size dictionary that it takes about 90 seconds to calculate so I want to save it once to storage and load it every time I want to use it again. This creates two challenges:

Loading the file has to take less than 90 seconds.
As RAM is limited (in pycharm) at ~4GB it cannot be memory-intensive.

I also need it to be utf-8 capable.

I have tried solutions such as pickle but they always end up throwing a Memory Error. Notice that my dictionary is made of Strings and thus solutions like in this post do not apply.

Things I do not care about:

Saving time (as long as it's not more than ~20 minutes, as I'm looking to do it once).
How much space it takes in storage to save the dictionary.

How can I do that? thanks

Edit:

I forgot to mention it's a dictionary containing sets, so json.dump() doesn't work as it can't handle sets.

Have you considered [sqlite3](https://docs.python.org/3/library/sqlite3.html) ? — Balaji Ambresh, Jun 08 '20 at 07:11
I have considered [SQLite3](https://docs.python.org/2/library/sqlite3.html), but never managed to make it work. It is also said [here](https://stackoverflow.com/questions/10913080/python-how-to-insert-a-dictionary-to-a-sqlite-database) that it "cannot be done easily". — Ilay Gussarsky, Jun 08 '20 at 09:14
I've read the other question, but *what exactly* cannot be done remains unclear. Insert the values of a dict into sqlite? Of course this can be done. But it would require deeper understanding of how you create the data (how uniform is it, how stable is it) and how you use the data in your program to find out what the best database strategy would be. — Tomalak, Jun 08 '20 at 16:23

score 1 · Accepted Answer · answered Jun 08 '20 at 09:38

1

If the dict consumes a lot of memory because it has many items, you could try dump many smaller dicts and combine them with update:

mk_pickle.py

import pickle

CHUNKSIZE = 10  #You will make this number of course bigger

def mk_chunks(d, chunk_size):
    chunk = {}
    ctr = chunk_size
    for key, val in d.items():
        chunk[key] = val
        ctr -= 1
        if ctr == 0:
            yield chunk
            ctr = chunk_size
            chunk = {}
    if chunk:
        yield chunk

def dump_big_dict(d):
    with open("dump.pkl", "wb") as fout:
        for chunk in mk_chunks(d, CHUNKSIZE):
            pickle.dump(chunk, fout)


# For testing:
N = 1000

big_dict = dict()
for n in range(N):
    big_dict[n] = "entry_" + str(n)

dump_big_dict(big_dict)

read_dict.py

import pickle

d= {}
with open("dump.pkl", "rb") as fin:
    while True:
        try:
            small_dict = pickle.load(fin)
        except EOFError:
            break
        d.update(small_dict)

answered Jun 08 '20 at 09:38

gelonida

5,327
2
23
41

The function `mk_chunks` can be written more elegantly. look for example at https://stackoverflow.com/questions/3992735/python-generator-that-groups-another-iterable-into-groups-of-n and use the answer from unutbu – gelonida Jun 08 '20 at 10:03
This worked perfect (and by the way, took 7.5 seconds). Thank you! – Ilay Gussarsky Jun 08 '20 at 10:41
@IlayGussarsky ...and you could reduce that to very close to zero if you would use a database. Pickle is not a good fit for this type of task. – Tomalak Jun 08 '20 at 15:58
You might be able to use a data base and json if you customize your json dumper / loader, but to see how easy it would be some sample records would be required. If they all have the same structure you could dump and load sets. Normally dumping and loadingjson is faster than pickling / unpickling. Not sure how much you would really win with a database though. I'm a little less optimistic than @Tomalak, but using a custom json serializer / deserializer should accelerate. Even a database needs a little time to read 1.5 G from a DB and co create dicts and sets from it – gelonida Jun 08 '20 at 16:02
I would not involve any JSON at all. Plain, basic tables with a proper index would be my tool of choice. I would not load all of it into memory, either, (that would defy the purpose of having an SQL database) but write specific queries to select or update data. – Tomalak Jun 08 '20 at 16:04
Ah now I understand what you mean. You mean to change the entire code to not use a dict, but to use a DB. Whether this will be efficient depends on the code and how often it accesses elements from the dict. – gelonida Jun 08 '20 at 16:07
1

In fact a solution could be using modules like `sqlitedict` ( https://pypi.org/project/sqlitedict/ ) The code almost doesn't change as sqlite dict creates an object, that behaves like a dict, but is stored in a (sqlite) database. This in combination with some memoization (caching of accesses to the dict) might be interesting. It all depends how many of the dict keys you are accessing during a typical run of your application. – gelonida Jun 08 '20 at 16:13
That's right, but if the data is expensive to calculate, the net overhead for storing it in a database as you go is negligible. An SQL SELECT will then outperform any "load all of it into memory again" by orders of magnitude, and - since the OP said they are memory-conscious - would be very conservative on RAM. But of course this only works if they don't actually *need* "all of it in memory" - and that's the part I'm assuming. – Tomalak Jun 08 '20 at 16:13

score 0 · Answer 2 · answered Jun 08 '20 at 09:23

You could try to generate and save it by parts in several files. I mean generate some key value pairs, store them in a file with pickle, and delete the dict from memory, then continue until all key value pair are exausted.

Then to load the whole dict use dict.update for each part, but that could also run in memory trouble, so instead you can make a class derived from dict which reads the corresponding file on demand according to the key (I mean overriding __getitiem__), something like this:

class Dict(dict):
    def __init__(self):
       super().__init__()
       self.dict = {}
    def __getitiem__(key):
        if key in self.dict:
            return self.dict[key]
        else:
            del self.dict  # destroy the old before the new is created
            self.dict = pickle.load(self.getFileName(key))
            return self.dict[key]

    filenames = ['key1', 'key1000', 'key2000']

    def getFileName(key):
        '''assuming the keys are separated in files by alphabetical order,
        each file name taken from its first key'''

        if key in filenames:
            return key
        else:
            A = list(sorted(filenames + [key]))
            return A[A.index(key) - 1]

Have in count that smaller dicts will be loaded faster, so you should experiment and find the right amount of files.

Also you can let reside in memory more than one dict according to memory resource.

no need to create multiple pickle files. you can write one object after the other into the same pickle file. every `pickle.load` will return one pickled item only. The next read will load the next one — gelonida, Jun 08 '20 at 09:40
Ah I see! So in the end you create a custom on disk persistent dict. In that case I would just use something like sqlitedict. — gelonida, Jun 08 '20 at 16:30

How to save and load a large dictionary to storage in python?

2 Answers2