5

I have a 51K X 8.5K data frame with just binary (1 or 0) values.

I wrote the following code:

Pickling the data to the disk

outfile=open("df_preference.p", "wb")
pickle.dump(df_preference,outfile)
outfile.close()

It throws me Memory Error as below:

MemoryError                               Traceback (most recent call last)
<ipython-input-48-de66e880aacb> in <module>()
      2 
      3 outfile=open("df_preference.p", "wb")
----> 4 pickle.dump(df_preference,outfile)
      5 outfile.close()

I am assuming it means this data is huge and it can't be pickled? But it just has binary values.

Before this, I created this dataset from another data frame which had normal counts and lot of zeros. Used the following code:

df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))

This itself took some time to create df_preference. Same size of matrix.

My concern is if it takes time to create a data frame using applymap and ii) doesn't even pickle the data frame due to memory error, then going ahead I need to do matrix factorization of this df_prefence using SVD and Alternating Least Squares. It would then be more slow? How to tackle this slow run and solve the memory error?

Thanks

Baktaawar
  • 7,086
  • 24
  • 81
  • 149

2 Answers2

4

UPDATE:

for 1 and 0 values you can use int8 (1-byte) dtype, which will reduce your memory usage by at least 4 times.

(df_recommender > 0).astype(np.int8).to_pickle('/path/to/file.pickle')

Here is an example with 51K x 9K data frame:

In [1]: df = pd.DataFrame(np.random.randint(0, 10, size=(51000, 9000)))

In [2]: df.shape
Out[2]: (51000, 9000)

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

the source DF needs 1.7 GB in memory

In [6]: df_preference = (df>0).astype(int)

In [7]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int32(9000)
memory usage: 1.7 GB

resulting DF again needs 1.7 GB in memory

In [4]: df_preference = (df>0).astype(np.int8)

In [5]: df_preference.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51000 entries, 0 to 50999
Columns: 9000 entries, 0 to 8999
dtypes: int8(9000)
memory usage: 437.7 MB

with int8 dtype it takes only 438 MB

now let's save it as a Pickle file:

In [10]: df_preference.to_pickle('d:/temp/df_pref.pickle')

file size:

{ temp }  » ls -lh df_pref.pickle
-rw-r--r-- 1 Max None 438M May 28 09:20 df_pref.pickle

OLD answer:

try this instead:

(df_recommender > 0).astype(int).to_pickle('/path/to/file.pickle')

Explanataion:

In [200]: df
Out[200]:
   a  b  c
0  4  3  3
1  1  2  1
2  2  1  0
3  2  0  1
4  2  0  4

In [201]: (df>0).astype(int)
Out[201]:
   a  b  c
0  1  1  1
1  1  1  1
2  1  1  0
3  1  0  1
4  1  0  1

PS you may also want to save your DF as HDF5 file instead of Pickle - see this comparison for details

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • quick question. This won't change the existing df data frame right? Second, to load the data is there similar function or do again the outfield and pickle thing? – Baktaawar May 27 '16 at 22:52
  • @Baktaawar, it won't change your DF. You can use `pd.read_pickle` for reading saved DF back - see the link with comparison for examples (which i've provided in my answer) – MaxU - stand with Ukraine May 27 '16 at 22:55
  • Ok. So your command worked. It pickled it. But it created a file of size 1.8GB. Now if I have to do matrix factorization on this data it would take time. Any suggestions to improve those? – Baktaawar May 27 '16 at 22:57
  • `print(df_recommender.info())` - will show you the memory usage. Open a new question about your "matrix factorization" with sample data and desired result set - it'll help to understand what you are going to achieve. – MaxU - stand with Ukraine May 27 '16 at 23:01
  • quick question. I did as below: df_preference.to_pickle('df_preference.pickle'). It throws an error. Why is that? What if I have already created a binary matrix and don't need to do (df>0).astype(int). It doesn't seem to work without this? – Baktaawar May 27 '16 at 23:05
0

I had a memory error saving DataFrame of approximately 8.5GB to pickle. The reason was the RAM shortage. This is all for Jupyter Notebook with Python 3.7.6

Tried df.to_pickle() with default parameters and df.to_hdf( ..., mode="w").

Both were giving me MemoryError as the process allocates additional memory when saving to these formats (HDF also apparently uses pickle internally).

I finally succeeded by saving to CSV: pd.to_csv(), as it does not use significant additional memory resources.

Here is the df.info() output for the DataFrame I was dealing this:

<class 'pandas.core.frame.DataFrame'>
Index: 141516896 entries, 1eedd4a85d23 to 1c0088d397a3
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   id             object 
 1   value          float64
 2   value2         object 
 3   value3         object 
 4   value4         object 
 5   value5         object 
 6   value6         object 
dtypes: float64(1), object(6)
memory usage: 8.4+ GB

The resulting file is around 15Gb, but at least I have not lost my data.

Hope that helps someone.

anatoly
  • 31
  • 3