Question
After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step.
Solution
Below is my solution, which is really convoluted, please let me know if you have a better way:
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# Example dataframe
data = {
'id':[13,13,14,14,14,15],
'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)
# Create dictionaries from original dataframe to save categories
# Part of the convoluted solution
dcts = []
df_cols = ['categ', 'size']
for col in df_cols:
cats = df[col].astype('category')
dct = dict(enumerate(cats.cat.categories))
dcts.append(dct)
# Change into category codes, otherwise sparse matrix cannot be built
for col in ['categ', 'size']:
df[col] = df[col].astype('category').cat.codes
# Group by into sparse columns
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')
# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)
piv.columns = piv.columns.to_flat_index().str.join('_')
# Encoding gives poor column names
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
piv_enc, columns=encoder.get_feature_names_out())
The column names look like this: 'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0'
, while we need the original category labels, i.e. 'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'
.
Convoluted part I need advice on
# Fixing column names
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
print(df_col, dct)
for i, piv_col in enumerate(piv_cols):
if df_col in piv_col:
if piv_col[-1:] != '-':
piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]
piv_fin.columns = piv_cols
I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!