How do I create a sparse matrix in CSR/COO format for a huge feature vector (50000 x 100000) from categorical data stored in Pandas DataFrame? I am creating the feature vector using Pandas get_dummies() function, but it returns a MemoryError. How do I avoid that and rather generate it in a sparse matrix CSR format?
Asked
Active
Viewed 1,011 times
3
-
Pandas has a sparse format, and experimental way of generating a `scipy` sparse matrix (something like `tocoo()`. Most likely the memory error is the result creating a large dense array as intermediary. Do an SO search. – hpaulj Nov 09 '15 at 21:19
-
What do you mean by an SO search? Also, the to_coo seems to be a method for a SparseSeries object and not a SparseDataFrame. How do I go around doing it for a SparseDataFrame? – ExtremistEnigma Nov 09 '15 at 23:10
3 Answers
0
Use:
scipy.sparse.coo_matrix(df_dummies)
but do not forget to create df_dummies sparse in the first place...
df_dummies = pandas.get_dummies(df, sparse=True)

ntg
- 12,950
- 7
- 74
- 95
0
This answer will keep the data as sparse as possible and avoids memory issues when using Pandas get_dummies.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
df = pd.DataFrame({'rowid':[1,2,3,4,5], 'category':['c1', 'c2', 'c1', 'c3', 'c1']})
print 'Input data frame\n{0}'.format(df)
print 'Encode column category as numerical variables'
print LabelEncoder().fit_transform(df.category)
print 'Encode column category as dummy matrix'
print OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1)).todense()
print 'Concat with the original data frame as a matrix'
dummy_matrix = OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1))
df_as_sparse = sparse.csr_matrix(df.drop(labels=['category'], axis=1).as_matrix())
sparse_combined = sparse.hstack((df_as_sparse, dummy_matrix), format='csr')
print sparse_combined.todense()

pettinato
- 1,472
- 2
- 19
- 39