I'm trying to use a sparse matrix in my regression since there are over 40,000 variables after I add dummy variables. In order to do this, I believe I need to feed the model a sparse matrix. However, converting my pandas dataframe into a matrix isn't possible using code found here:
Convert Pandas dataframe to Sparse Numpy Matrix directly
This is because the dataset is too large, and I run into a memory error. Here's an example of how I can replicate the issue by running the following:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40000,size=(1000000, 4)), columns=list('ABCD'))
df = pd.get_dummies(df,columns=['D'],sparse=True,drop_first=True)
df = df.values
I'd ultimately like to be able to convert the dataframe (3 million records with 49,000 columns) into a matrix because I suspect I can create a sparse matrix and use that for my regression. This works quite well on a smaller subset, but I ultimately need to test the entire dataset. The above example yields a "MemoryError" right away, so I suspect it's some Python limitation, but I am hoping there is a workaround.