Frequently Updating Stored Data for a Numerical Experiment using Python

Question

I am running a numerical experiment that requires many iterations. After each iteration, I would like to store the data in a pickle file or pickle-like file in case the program times-out or a data structure becomes tapped. What is the best way to proceed. Here is the skeleton code:

data_dict = {}                       # maybe a dictionary is not the best choice
for j in parameters:                 # j = (alpha, beta, gamma) and cycle through
    for k in number_of_experiments:  # lots of experiments (10^4)
        file = open('storage.pkl', 'ab')
        data = experiment()          # experiment returns some numerical value
                                     # experiment takes ~ 1 seconds, but increase
                                     # as parameters scale
        data_dict.setdefault(j, []).append(data)
        pickle.dump(data_dict, file)
        file.close()

Questions:

Is shelve a better choice here? Or some other python library that I am not aware?
I am using data dict because it's easier to code and more flexible if I need to change things as I do more experiments. Would it be a huge advantage to use a pre-allocated array?
Does opening and closing files affect run time? I do this so that I can check on the progress in addition to the text logs I have set up.

Thank you for all your help!

For your "open" overhead I'm getting about 39 microseconds for open with 'ab' options for a few bytes file, 41 microseconds (us) for 1kB, 44 microseconds for ~10kB, 158 us for 100kB and 2MB and 162 us for 20MB files. So not a lot if your file size is below 20MB... This is with an SSD so YMMV. — dhj, Jun 28 '14 at 23:14

newtover · Accepted Answer · 2014-06-27T17:29:52.243

1

Assuming you are using numpy for your numerical experiments, instead of pickle I would suggest using numpy.savez.
Keep it simple and make optimizations only if it you feel that the script runs too long.
Opening and closing files does affect the run time, but having a backup is anyway better.

And I would use collections.defaultdict(list) instead of plain dict and setdefault.

edited Jun 27 '14 at 17:29

answered Jun 27 '14 at 17:09

newtover

31,286
11
84
89

Mike McKerns · Answer 2 · 2014-06-28T23:02:25.837

Shelve is probably not a good choice, however...

You might try using klepto or joblib. Both are good at caching results, and can use efficient storage formats.

Both joblib and klepto can save your results to a file on disk, or to a directory. Both can also leverage the numpy internal storage format and/or compression on save… and also save to memory mapped files, if you like.

If you use klepto, it takes the dictionary key as the filename, and saves the value as the contents. With klepto, you can also pick whether you want to use pickle or json or some other storage format.

Python 2.7.7 (default, Jun  2 2014, 01:33:50) 
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> data_dict = klepto.archives.dir_archive('storage', cached=False, serialized=True)     
>>> import string
>>> import random
>>> for j in string.ascii_letters:
...   for k in range(1000):
...     data_dict.setdefault(j, []).append([int(10*random.random()) for i in range(3)])
... 
>>>

This will create a directory called storage that contains pickled files, one for each key of your data_dict. There are keywords for using memmap files, and also for compression level. If you choose cached=False, then instead of dumping to file each time you wrote to data_dict, you'd write to memory each time… and you could then use data_dict.dump() to dump to disk whenever you choose… or you could pick a memory limit that when you hit it, you'd dump to disk. Additionally, you can also pick a caching strategy (like lru or lfu) for deciding which keys you would purge from memory and dump to disk.

Get klepto here: https://github.com/uqfoundation

or get joblib here: https://github.com/joblib/joblib

If you refactor, you could probably come up with a way to do this so it could take advantage of a pre-allocated array. However, it might depend on the profile of how your code runs.

Does opening and closing files affect run time? Yes. If you use klepto, you can set the granularity of when you want to dump to disk. Then you can pick a trade-off of speed versus intermediate storage of results.

You are welcome. I probably remember enough of joblib to add an example of that as well, if you find that the above doesn't work for you. — Mike McKerns, Jun 30 '14 at 21:41
Hey Mike, Know this question was officially closed, but have recently run into this [problem](http://stackoverflow.com/questions/25924397/python-multiprocessing-and-serializing-data) and am pretty stuck. I have emailed it to my knowledgeable friends in hopes someone will answer it, but no luck yet. I am wondering if you have encountered this difficulty with such data and how you solved it. In fact, I am commenting here, in the hope that one of these modules might actually remedy the situation. — Charlie, Oct 03 '14 at 01:34
I ended up optimizing the actual python code so didn't need to implement data storage using the modules you suggested to speed up run time; hence the reason, I stuck with pickle just to get it working the first time! — Charlie, Oct 03 '14 at 01:36

Frequently Updating Stored Data for a Numerical Experiment using Python

2 Answers2

Linked