5

I am trying to store scipy.sparse.csr.csr_matrix of shape (1482535, 67826) into a dataframe but I am getting an error as below. I am running on Google Cloud Platform with 4CPU's , and 208 GB memory. I can't increase my memory more. How can I solve this issue? Any suggestions are appreciated.

type(x_train_bow_name)`
scipy.sparse.csr.csr_matrix







   data1 = pd.DataFrame(x_train_bow_name.toarray())`




        ----------------------------------------------------------------- 
        ----------
        MemoryError                               Traceback (most recent 
        call 
        last)
       <ipython-input-16-283fa4dd2dd6> in <module>
         ----> 1 data1 = pd.DataFrame(x_train_bow_name.toarray())

                /usr/local/lib/python3.5/dist- 
         packages/scipy/sparse/compressed.py in toarray(self, order, out)
            1022         if out is None and order is None:
       1023             order = self._swap('cf')[0]
       -> 1024         out = self._process_toarray_args(order, out)
       1025         if not (out.flags.c_contiguous or 
      out.flags.f_contiguous):
            1026             raise ValueError('Output array must be C or 
         F contiguous')

        /usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py in 
           _process_toarray_args(self, order, out)
                1184             return out
             1185         else:
            -> 1186             return np.zeros(self.shape,  
          dtype=self.dtype, order=order)
         1187 
        1188 

     MemoryError: Unable to allocate array with shape (1482535, 67826) 
      and data type int64
Nibrass H
  • 2,403
  • 1
  • 8
  • 14
ram
  • 61
  • 1
  • 1
  • 4
  • 1
    are you running a 64bit python? – Paul Nov 05 '19 at 10:51
  • 1
    Hey! This seems a memory overcommitment issue. Check [this other question](https://stackoverflow.com/questions/57507832/unable-to-allocate-array-with-shape-and-data-type) where the same issue was solved for numpy. Let me know if this helps you. – sotis Nov 07 '19 at 09:43

2 Answers2

2

Everything is pointing that it is an Overcommitting Memory issue.

Please take a look of the following link where is very well explained the virtual memory settings in Linux and the problem with Overcommit.

The two parameters to modify the overcommit settings are /proc/sys/vm/overcommit_memory and /proc/sys/vm/overcommit_ratio.

/proc/sys/vm/overcommit_memory This switch knows 3 different settings, but in your case I think you could use the value =1

1: Classic example is code using sparse arrays and just relying on the virtual memory consisting almost entirely of zero pages.

The setting can be changed by a superuser:

echo 1 > /proc/sys/vm/overcommit_memory

The default value is 0

/proc/sys/vm/overcommit_ratio This setting is only used when overcommit_memory = 2, and defines how many percent of the physical RAM is used. Swap space goes on top of that. The default is “50”, or 50%.

The setting can be changed by a superuser:

echo 75 > /proc/sys/vm/overcommit_ratio

Also please take a look at the following post

Jose Luis Delgadillo
  • 2,348
  • 1
  • 6
  • 16
0

The way to read sparse scipy matrixes in pandas is:

pd.DataFrame.sparse.from_spmatrix(my_sparse_matrix)

This creates a DataFrame with sparse dtypes. Sparse dtypes are of the form pd.SparseDtype("float", np.nan)

By doing my_sparse_matrix.toarray() you are undoing the "sparsity" of it.

LeoArtaza
  • 163
  • 1
  • 6