Most efficient way of storing dictionary consisting dataframes

Question

I have a dictionary that contains dataframes.

dictionary = {"key1": df1,
              "key2": df2, and so on...}

Few stackoverflow posts and reddit suggests Json module and pickle module.

What would be most efficient way and why?

when I convert small dictionary into pickle it has memory less than 0kb and it renders EOFError: Ran out of input which is explained here Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

use pickle. you might not be able to store all possible data of the dataframes in a json — luigigi, Dec 17 '19 at 06:41

René · Accepted Answer · 2019-12-17T07:21:23.903

4

I would suggest using pickle when you prefer a compact file format.

# import packages
import pandas as pd
import numpy as np
import pickle
import os

# create dictionary of dataframes
nrows, ncols, ndataframes = 1_000, 50, 100
my_dict = {k:v for (k,v) in [[f'df_{n}', pd.DataFrame(np.random.rand(nrows, ncols))] for n in range(ndataframes)]}

# save dictionary as pickle file
pickle_out = open('my_dict.pickle', 'wb')
pickle.dump(my_dict, pickle_out)
pickle_out.close()

# create new dictionary from pickle file
pickle_in = open('my_dict.pickle', 'rb')
new_dict = pickle.load(pickle_in)

# print file size
print('File size pickle file is', round(os.path.getsize('my_dict.pickle') / (1024**2), 1), 'MB')

# sample
new_dict['df_10'].iloc[:5, :5]

Result:

File size pickle file is 38.2 MB

          0         1         2         3         4
0  0.338838  0.501158  0.406240  0.693233  0.567305
1  0.092142  0.569312  0.952694  0.083705  0.006950
2  0.684314  0.373091  0.550300  0.391419  0.877889
3  0.117929  0.597653  0.726894  0.763094  0.466603
4  0.530755  0.472033  0.553457  0.863435  0.906389

edited Dec 17 '19 at 07:21

answered Dec 17 '19 at 07:11

René

4,594
5
23
52

Thanks! I am testing on small dictionary however when in `new_dict = pickle.load(pickle_in)` it gives EOFError: Ran out of input – haneulkim Dec 17 '19 at 07:35
Can you check if your pickle file is written to disk successfully (and has a file size > 0 bytes). – René Dec 17 '19 at 07:59
When you can share your code, I will have a look. The code in my answer runs without problems on my machine using Python 3.6.9 – René Dec 17 '19 at 08:00
I've just created a two dataframe with 3 rows each. Even though it has two 3x5 df it has 0kb. – haneulkim Dec 17 '19 at 08:07
Can you add your code to your question. I suspect that reading the pickle fails because saving the pickle file was not successful. – René Dec 17 '19 at 08:32
I just forgot () after pickle_out.close. Thanks for your help. – haneulkim Dec 17 '19 at 08:49

score 0 · Answer 2 · answered Dec 17 '19 at 08:32

0

Another alternative could be the HDFStore which is a dict-like object which reads and writes pandas using the high performance HDF5 format, more details here: http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html#hdf5-pytables

answered Dec 17 '19 at 08:32

dejanualex

3,872
6
22
37

Most efficient way of storing dictionary consisting dataframes

2 Answers2