I have a dataframe of size 78,000,000 rows x 14 columns. I want to get a sparse matrix from it for data training. To do this, I do the conversion using pd dummies. And I get a dataframe of 78,000,000 rows x 1100 columns. Next, I create a lil_matrix and try to fill it, but I am out of memory. I have 32 GB of RAM.
Please tell me how can I do this? Here is my code for converting the dataframe to sparse_matrix:
my_arr = lil_matrix(df.shape, dtype=np.uint8)
for i, column in enumerate(df.columns):
inx = df[column] != 0
my_arr[np.where(inx), i] = 1
my_arr.tocsr()
Update: scipy.sparse.csr_matrix(df.values) is not valid because df.values takes up a lot of memory and it doesn't solve my problem.
Update2: I can't add tracking, because when it reaches 32 GB, the kernel restarts. I can add that it eats a lot of memory, there is not enough RAM. enter image description here
Update3: A user with the nickname CJR made a cool hint. To convert a DataFrame to a sparse matrix, just do this:
Data_frame_csr = pd.get_dummies(Data_frame, columns=[name1, name2 ..., nameN], dummy_na=True, sparse=True).sparse.to_coo().tocsr()
dummy_na - Takes into account values equal to Nan(See documentation)