3

How do I create a sparse matrix in CSR/COO format for a huge feature vector (50000 x 100000) from categorical data stored in Pandas DataFrame? I am creating the feature vector using Pandas get_dummies() function, but it returns a MemoryError. How do I avoid that and rather generate it in a sparse matrix CSR format?

ExtremistEnigma
  • 239
  • 3
  • 12
  • Pandas has a sparse format, and experimental way of generating a `scipy` sparse matrix (something like `tocoo()`. Most likely the memory error is the result creating a large dense array as intermediary. Do an SO search. – hpaulj Nov 09 '15 at 21:19
  • What do you mean by an SO search? Also, the to_coo seems to be a method for a SparseSeries object and not a SparseDataFrame. How do I go around doing it for a SparseDataFrame? – ExtremistEnigma Nov 09 '15 at 23:10

3 Answers3

0

Use:

scipy.sparse.coo_matrix(df_dummies)

but do not forget to create df_dummies sparse in the first place...

df_dummies = pandas.get_dummies(df, sparse=True)
ntg
  • 12,950
  • 7
  • 74
  • 95
0

This answer will keep the data as sparse as possible and avoids memory issues when using Pandas get_dummies.

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

df = pd.DataFrame({'rowid':[1,2,3,4,5], 'category':['c1', 'c2', 'c1', 'c3', 'c1']})

print 'Input data frame\n{0}'.format(df)

print 'Encode column category as numerical variables'
print LabelEncoder().fit_transform(df.category)

print 'Encode column category as dummy matrix'
print OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1)).todense()

print 'Concat with the original data frame as a matrix'
dummy_matrix = OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1))
df_as_sparse = sparse.csr_matrix(df.drop(labels=['category'], axis=1).as_matrix())
sparse_combined = sparse.hstack((df_as_sparse, dummy_matrix), format='csr')
print sparse_combined.todense()
pettinato
  • 1,472
  • 2
  • 19
  • 39