I'm having a problem with Scikit Learn's one-hot and ordinal encoders that I hope someone can explain to me.
I'm following along with a Towards Data Science article that uses a Kaggle data set to predict student academic performance. I'm running on a MacBook Pro M1 and Python 3.11 using a virtual environment.
The data set has four nominal independent variables ['gender', 'race/ethnicity', 'lunch', 'test preparation course']
, one ordinal independent variable ['parental level of education']
, and aspires to create a regression model that predicts the average of math, reading, and writing scores.
I'd like to set up an SKLearn pipeline that uses their OneHotEncoder
and OrdinalEncoder
. I am not interested in using Pandas DataFrame get_dummies
. My goal is a pure SK Learn pipeline.
I tried to approach the problem in steps. The first attempt uses the encoders without a column_transformer
:
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform='pandas')
education_levels = [
"some high school",
"high school",
"some college",
"associate's degree",
"bachelor's degree",
"master's degree",
"doctorate"]
oe = OrdinalEncoder(categories=education_levels)
encoded_data = pd.DataFrame(ohe.fit_transform(df[categorical_attributes]))
df = df.join(encoded_data)
encoded_data = pd.DataFrame(ohe.fit_transform(df[ordinal_attributes]))
df = df.join(encoded_data)
The data frame df
has both ordinal and categorical variables encoded perfectly. It's ready to feed to whatever regression algorithm I want. Note the .join
steps that add the encoded data to the data frame.
The second attempt instantiates a column transformer:
column_transformer = make_column_transformer(
(ohe, categorical_attributes),
(oe, ordinal_attributes))
column_transformer.fit_transform(X)
The data frame X
contains only the ordinal and categorical independent variables.
When I run this version I get a stack trace that ends with the following error:
ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).
Is the error telling me that I need to turn the categories list into an array and transpose it? Why does it work perfectly when I apply .fit_transform
for each encoder sequentially without the pipeline?
The code I have appears to be what the article recommends.
I'm troubled by the column transformer. If it applies .fit_transform
for each encoder sequentially it won't be doing those intermediate .join
steps that my first solution has. Those are key to adding the new columns to the data frame.
I want to make the pipeline work. What am I missing, and how do I fix it?
Edit:
I logged into ChatGPT and asked it for comments on my code. Making this change got me almost there:
# Wrapping education_levels with [] did it.
oe = OrdinalEncoder(categories=[education_levels])
I have just one question remaining. The Pandas DataFrame that I get out has all the right values, but the column names are missing. How do I add them? Why were they not generated for me, as they were in the solution that did not use the pipeline?