Memory Error while pickling a data frame to disk

Question

I have a 51K X 8.5K data frame with just binary (1 or 0) values.

I wrote the following code:

Pickling the data to the disk

outfile=open("df_preference.p", "wb")
pickle.dump(df_preference,outfile)
outfile.close()

It throws me Memory Error as below:

MemoryError                               Traceback (most recent call last)
<ipython-input-48-de66e880aacb> in <module>()
      2 
      3 outfile=open("df_preference.p", "wb")
----> 4 pickle.dump(df_preference,outfile)
      5 outfile.close()

I am assuming it means this data is huge and it can't be pickled? But it just has binary values.

Before this, I created this dataset from another data frame which had normal counts and lot of zeros. Used the following code:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

This itself took some time to create df_preference. Same size of matrix.

My concern is if it takes time to create a data frame using applymap and ii) doesn't even pickle the data frame due to memory error, then going ahead I need to do matrix factorization of this df_prefence using SVD and Alternating Least Squares. It would then be more slow? How to tackle this slow run and solve the memory error?

Thanks

score 4 · Answer 1 · edited May 23 '17 at 11:58

UPDATE:

for 1 and 0 values you can use int8 (1-byte) dtype, which will reduce your memory usage by at least 4 times.

(df_recommender > 0).astype(np.int8).to_pickle('/path/to/file.pickle')

Here is an example with 51K x 9K data frame:

In [1]: df = pd.DataFrame(np.random.randint(0, 10, size=(51000, 9000)))

In [2]: df.shape
Out[2]: (51000, 9000)

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

the source DF needs 1.7 GB in memory

In [6]: df_preference = (df>0).astype(int)

In [7]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

resulting DF again needs 1.7 GB in memory

In [4]: df_preference = (df>0).astype(np.int8)

In [5]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int8(9000)
memory usage: 437.7 MB

with int8 dtype it takes only 438 MB

now let's save it as a Pickle file:

In [10]: df_preference.to_pickle('d:/temp/df_pref.pickle')

file size:

{ temp }  » ls -lh df_pref.pickle
-rw-r--r-- 1 Max None 438M May 28 09:20 df_pref.pickle

OLD answer:

try this instead:

(df_recommender > 0).astype(int).to_pickle('/path/to/file.pickle')

Explanataion:

In [200]: df
Out[200]:
   a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

In [201]: (df>0).astype(int)
Out[201]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

PS you may also want to save your DF as HDF5 file instead of Pickle - see this comparison for details

quick question. This won't change the existing df data frame right? Second, to load the data is there similar function or do again the outfield and pickle thing? — Baktaawar, May 27 '16 at 22:52
@Baktaawar, it won't change your DF. You can use `pd.read_pickle` for reading saved DF back - see the link with comparison for examples (which i've provided in my answer) — MaxU - stand with Ukraine, May 27 '16 at 22:55
Ok. So your command worked. It pickled it. But it created a file of size 1.8GB. Now if I have to do matrix factorization on this data it would take time. Any suggestions to improve those? — Baktaawar, May 27 '16 at 22:57
`print(df_recommender.info())` - will show you the memory usage. Open a new question about your "matrix factorization" with sample data and desired result set - it'll help to understand what you are going to achieve. — MaxU - stand with Ukraine, May 27 '16 at 23:01
quick question. I did as below: df_preference.to_pickle('df_preference.pickle'). It throws an error. Why is that? What if I have already created a binary matrix and don't need to do (df>0).astype(int). It doesn't seem to work without this? — Baktaawar, May 27 '16 at 23:05

score 0 · Answer 2 · answered Aug 10 '20 at 16:34

I had a memory error saving DataFrame of approximately 8.5GB to pickle. The reason was the RAM shortage. This is all for Jupyter Notebook with Python 3.7.6

Tried df.to_pickle() with default parameters and df.to_hdf( ..., mode="w").

Both were giving me MemoryError as the process allocates additional memory when saving to these formats (HDF also apparently uses pickle internally).

I finally succeeded by saving to CSV: pd.to_csv(), as it does not use significant additional memory resources.

Here is the df.info() output for the DataFrame I was dealing this:

<class 'pandas.core.frame.DataFrame'>
Index: 141516896 entries, 1eedd4a85d23 to 1c0088d397a3
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   id             object 
 1   value          float64
 2   value2         object 
 3   value3         object 
 4   value4         object 
 5   value5         object 
 6   value6         object 
dtypes: float64(1), object(6)
memory usage: 8.4+ GB

The resulting file is around 15Gb, but at least I have not lost my data.

Hope that helps someone.

Memory Error while pickling a data frame to disk

Pickling the data to the disk

2 Answers2