Creating a sparse matrix from a large dataframe in Python

Question

I'm trying to use a sparse matrix in my regression since there are over 40,000 variables after I add dummy variables. In order to do this, I believe I need to feed the model a sparse matrix. However, converting my pandas dataframe into a matrix isn't possible using code found here:

Convert Pandas dataframe to Sparse Numpy Matrix directly

This is because the dataset is too large, and I run into a memory error. Here's an example of how I can replicate the issue by running the following:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,40000,size=(1000000, 4)), columns=list('ABCD'))
df = pd.get_dummies(df,columns=['D'],sparse=True,drop_first=True)
df = df.values

I'd ultimately like to be able to convert the dataframe (3 million records with 49,000 columns) into a matrix because I suspect I can create a sparse matrix and use that for my regression. This works quite well on a smaller subset, but I ultimately need to test the entire dataset. The above example yields a "MemoryError" right away, so I suspect it's some Python limitation, but I am hoping there is a workaround.

That seems to create a sparse dataframe, but not a sparse matrix which is what I need. — sqlnewbie1979, Apr 05 '19 at 14:37
Do you want the sparse matrix for the dummies (the `D_*` columns) or the whole data frame? — jdehesa, Apr 05 '19 at 14:52
For the D_* columns, I suppose, although I may not fully understand your question. — sqlnewbie1979, Apr 05 '19 at 14:54

score 0 · Answer 1 · answered Apr 05 '19 at 14:56

Sparse matrix is costly operation. Using Spicy, it is very difficult to create large sparse matrix and your system memory might not support.

I suggest to use Spark libraries. So that your data set will run on different clusters (RDD). below is the sample code,

from pyspark.mllib.linalg import Vectors sparse = Vectors.sparse(3, [0, 2], [1.0, 3.0])

I hope it helps you. Please let me know if you still have any questions, i would be very happy to help you.

score 0 · Answer 2 · answered Apr 05 '19 at 14:58

You can do that like this:

import numpy as np
import pandas as pd
import scipy.sparse

N = 40000
M = 1000000
df = pd.DataFrame(np.random.randint(0, N, size=(M, 4)), columns=list('ABCD'))
v = df['D'].values
sp = scipy.sparse.coo_matrix((np.ones_like(v), (np.arange(len(v)), v)), shape=[len(v), N])
print(sp.shape)
# (1000000, 40000)
print(sp.getnnz())
# 1000000

Creating a sparse matrix from a large dataframe in Python

2 Answers2