I am trying to pivot a dataframe with categorical features directly into a sparse matrix. My question is similar to this question, or this one, but my dataframe contains multiple categorical variables, so those approaches don't work.
This code currently works, but df.pivot()
works with a dense matrix and with my real dataset, I run out of RAM. Can you suggest a way to work with sparse matrices from the start?
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# Example dataframe
data = {
'id':[13,13,14,14,14,15],
'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)
# Pivoting -- problematic step creating a huge dense matrix
pivot = df.pivot(index='id', columns='name', values=['categ','size']).fillna('-')
cols = pivot.columns.to_flat_index().str.join('_')
pivot.columns = cols
# One hot encoding into a sparse matrix
encoder = OneHotEncoder(sparse_output=True)
pivot_enc = encoder.fit_transform(pivot)
pivot = pd.DataFrame.sparse.from_spmatrix(
pivot_enc, columns=encoder.get_feature_names_out())
print(pivot.dtypes)
Thanks for help!