Pivot dataframe with multiple categorical variables into sparse matrix

Question

I am trying to pivot a dataframe with categorical features directly into a sparse matrix. My question is similar to this question, or this one, but my dataframe contains multiple categorical variables, so those approaches don't work.

This code currently works, but df.pivot() works with a dense matrix and with my real dataset, I run out of RAM. Can you suggest a way to work with sparse matrices from the start?

import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder

# Example dataframe
data = {
  'id':[13,13,14,14,14,15],
  'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
  'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
  'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']

}
df = pd.DataFrame(data)

# Pivoting -- problematic step creating a huge dense matrix
pivot = df.pivot(index='id', columns='name', values=['categ','size']).fillna('-')
cols = pivot.columns.to_flat_index().str.join('_')
pivot.columns = cols

# One hot encoding into a sparse matrix
encoder = OneHotEncoder(sparse_output=True)
pivot_enc = encoder.fit_transform(pivot)
pivot = pd.DataFrame.sparse.from_spmatrix(
    pivot_enc, columns=encoder.get_feature_names_out())

print(pivot.dtypes)

Thanks for help!

Dudelstein · Accepted Answer · 2023-04-20T14:24:04.763

The trick is in using a groupby(), which can work with Sparse types directly, but only if categorical variables are first encoded using .cat.codes.

Here is the answer which uses sparse matrices throughout:

for col in ['categ', 'size']:
    df[col] = df[col].astype('category').cat.codes

# Group by into sparse columns
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse')

# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)

piv.columns = piv.columns.to_flat_index().str.join('_')

# Encoding works as previously
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
    piv_enc, columns=encoder.get_feature_names_out())

The way to keep the original category labels can be found in this post.

Pivot dataframe with multiple categorical variables into sparse matrix

1 Answers1