0

I have a dataframe of size 78,000,000 rows x 14 columns. I want to get a sparse matrix from it for data training. To do this, I do the conversion using pd dummies. And I get a dataframe of 78,000,000 rows x 1100 columns. Next, I create a lil_matrix and try to fill it, but I am out of memory. I have 32 GB of RAM.

Please tell me how can I do this? Here is my code for converting the dataframe to sparse_matrix:

my_arr = lil_matrix(df.shape, dtype=np.uint8)
for i, column in enumerate(df.columns):
    inx = df[column] != 0
    my_arr[np.where(inx), i] = 1

my_arr.tocsr()

Update: scipy.sparse.csr_matrix(df.values) is not valid because df.values takes up a lot of memory and it doesn't solve my problem.

Update2: I can't add tracking, because when it reaches 32 GB, the kernel restarts. I can add that it eats a lot of memory, there is not enough RAM. enter image description here

Update3: A user with the nickname CJR made a cool hint. To convert a DataFrame to a sparse matrix, just do this:

Data_frame_csr = pd.get_dummies(Data_frame, columns=[name1, name2 ..., nameN], dummy_na=True, sparse=True).sparse.to_coo().tocsr()

dummy_na - Takes into account values equal to Nan(See documentation)

Adele
  • 1
  • 2
  • 1
    Try to look at https://stackoverflow.com/questions/20459536/convert-pandas-dataframe-to-sparse-numpy-matrix-directly – maria Jan 22 '21 at 19:37
  • Does this answer your question? [Convert Pandas dataframe to Sparse Numpy Matrix directly](https://stackoverflow.com/questions/20459536/convert-pandas-dataframe-to-sparse-numpy-matrix-directly) – Chris Jan 22 '21 at 19:37
  • @maria and Chris thanks for the answer, but it doesn't work. From the comments to that answer: "But this is suppose to generate a memory copy, isn't it? As df.values is essentially returning a dense matrix, and cast to csr_matrix handle. It doesn't work for large matrix." – Adele Jan 22 '21 at 20:13
  • Show the traceback so we can better see where the error occurs – hpaulj Jan 22 '21 at 20:15
  • @hpaulj Thanks for the comment, answered from above, update2! – Adele Jan 22 '21 at 20:50
  • pd.get_dummies(..., sparse=True) doesn't do what you want? – CJR Jan 22 '21 at 23:48
  • @CJR Thanks for the comment. I just use pd.get_dummies (..., sparse = True) to turn a regular dataframe into a dataframe suitable for training, but it doesn't fit if trained in model.fit. – Adele Jan 23 '21 at 10:54
  • pd.get_dummies(...,sparse=True).sparse.to_coo().tocsr() should give you a csr matrix no problem. – CJR Jan 23 '21 at 14:42
  • @CJR you are a genius! Yes it helped to create a sparse matrix, thanks! But there is still not enough RAM to train the decision tree model. – Adele Jan 24 '21 at 14:10

0 Answers0