0

I'm having a problem with Scikit Learn's one-hot and ordinal encoders that I hope someone can explain to me.

I'm following along with a Towards Data Science article that uses a Kaggle data set to predict student academic performance. I'm running on a MacBook Pro M1 and Python 3.11 using a virtual environment.

The data set has four nominal independent variables ['gender', 'race/ethnicity', 'lunch', 'test preparation course'] , one ordinal independent variable ['parental level of education'], and aspires to create a regression model that predicts the average of math, reading, and writing scores.

I'd like to set up an SKLearn pipeline that uses their OneHotEncoder and OrdinalEncoder. I am not interested in using Pandas DataFrame get_dummies. My goal is a pure SK Learn pipeline.

I tried to approach the problem in steps. The first attempt uses the encoders without a column_transformer:

ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform='pandas')
education_levels = [
    "some high school",
    "high school",
    "some college",
    "associate's degree",
    "bachelor's degree",
    "master's degree",
    "doctorate"]
oe  = OrdinalEncoder(categories=education_levels)
encoded_data = pd.DataFrame(ohe.fit_transform(df[categorical_attributes]))
df = df.join(encoded_data)
encoded_data = pd.DataFrame(ohe.fit_transform(df[ordinal_attributes]))
df = df.join(encoded_data)

The data frame df has both ordinal and categorical variables encoded perfectly. It's ready to feed to whatever regression algorithm I want. Note the .join steps that add the encoded data to the data frame.

The second attempt instantiates a column transformer:

column_transformer = make_column_transformer(
    (ohe, categorical_attributes),
    (oe, ordinal_attributes))
column_transformer.fit_transform(X)

The data frame X contains only the ordinal and categorical independent variables.

When I run this version I get a stack trace that ends with the following error:

ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).

Is the error telling me that I need to turn the categories list into an array and transpose it? Why does it work perfectly when I apply .fit_transform for each encoder sequentially without the pipeline?

The code I have appears to be what the article recommends.

I'm troubled by the column transformer. If it applies .fit_transform for each encoder sequentially it won't be doing those intermediate .join steps that my first solution has. Those are key to adding the new columns to the data frame.

I want to make the pipeline work. What am I missing, and how do I fix it?

Edit:

I logged into ChatGPT and asked it for comments on my code. Making this change got me almost there:

# Wrapping education_levels with [] did it.
oe  = OrdinalEncoder(categories=[education_levels])

I have just one question remaining. The Pandas DataFrame that I get out has all the right values, but the column names are missing. How do I add them? Why were they not generated for me, as they were in the solution that did not use the pipeline?

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 1
    I think [this question](https://stackoverflow.com/questions/59525929/valueerror-shape-mismatch-if-categories-is-an-array-it-has-to-be-of-shape-n) might be related. – Minh-Long Luu Mar 18 '23 at 02:42
  • Thank you. I saw it but didn't understand its significance. I'll go over it more carefully. – duffymo Mar 18 '23 at 10:47

2 Answers2

1

The below quick fix worked for me.

np_education_levels = [np.array(education_levels, dtype=object)]

The above representation is the same format when you access the attribute oe.categories after you fit()

But, when using list() instead of [] gave me the same error :( Full code below

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

X = pd.read_csv(r"datasets/StudentsPerformance.csv")

categorical_attributes = ['gender', 'race/ethnicity', 'lunch', 'test preparation course']
ordinal_attributes = ['parental level of education']

education_levels = [
    "some high school",
    "high school",
    "some college",
    "associate's degree",
    "bachelor's degree",
    "master's degree"]

# the modified line below
np_education_levels = [np.array(education_levels, dtype=object)]

ohe = OneHotEncoder(sparse_output=False)
oe = OrdinalEncoder(categories=np_education_levels)

column_transformer = make_column_transformer(
    (ohe, categorical_attributes),
    (oe, ordinal_attributes)).set_output(transform="pandas")

X_transformed = column_transformer.fit_transform(X)
JustSurf
  • 11
  • 1
  • Thank you. As I explain above, ChatGPT suggested wrapping education_level with []. Just one question remaining re: column names. They were generated with the non-pipeline solution. Why not for the pipeline? – duffymo Mar 18 '23 at 16:30
  • 1
    Column names meaning the original columns? In your non-pipeline solution if you check the data returned by OneHotEncoder and OrdinalEncoder it is just the transformed columns (without the original column). Since you join your original dataframe with these transformed data, you have all columns in the end. Also, if you want the non-transformed columns in the end (math score, writing score, and reading score) you just have to add "remainder = 'passthrough'" as a parameter to your make_column_transformer :) – JustSurf Mar 18 '23 at 17:34
  • No, I want the column names generated by the encoders. – duffymo Mar 18 '23 at 21:05
-1

Thanks to ChatGPT and all who responded here. I've got the complete solution below:

import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79

if __name__ == '__main__':
    df = pd.read_csv('../../resources/StudentsPerformance.csv')
    print(df.info())
    print(df.describe())

    # five categorical independent variables, three numeric dependent variables
    # calculate correlation for math, reading, and writing scores
    numerical_attributes = ['math score', 'reading score', 'writing score']
    numerical_data = df[numerical_attributes]
    categorical_attributes = ['gender', 'race/ethnicity', 'lunch', 'test preparation course']
    categorical_data = df[categorical_attributes]
    ordinal_attributes = ['parental level of education']
    ordinal_data = df[ordinal_attributes]

    corr_matrix = numerical_data.corr()
    # https://stackoverflow.com/questions/31698861/add-column-to-the-end-of-pandas-dataframe-containing-average-of-previous-data
    df = df.assign(mean_score = df[numerical_attributes].mean(axis=1, numeric_only=True))

    # Begin data analysis
#    score_by_gender = numeric_data_with_mean_score.groupby('gender')['mean_score'].mean()
#    score_by_race   = numeric_data_with_mean_score.groupby('race/ethnicity')['mean_score'].mean()
#    score_by_edu    = numeric_data_with_mean_score.groupby('parental level of education')['mean_score'].mean()
#    score_by_lunch  = numeric_data_with_mean_score.groupby('lunch')['mean_score'].mean()
#    score_by_prep   = numeric_data_with_mean_score.groupby('test preparation course')['mean_score'].mean()
#
#    c = ['red', 'orange', 'yellow', 'green', 'blue']
#    for i in range(0, len(categorical_attributes)):
#        plt.bar(numeric_data_with_mean_score[categorical_attributes[i]], numeric_data_with_mean_score['mean_score'])
#        plt.show()
#
#    # https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/
#    plt.hist(numeric_data_with_mean_score['mean_score'], bins=50, color='red')
#    # https://www.machinelearningplus.com/plots/matplotlib-histogram-python-examples/
#    sns.displot(numeric_data_with_mean_score['mean_score'], color='dodgerblue')
#    plt.show()
    # End exploratory data analysis

    # Begin data cleaning
    X = df.drop(columns=['math score', 'reading score', 'writing score', 'mean_score'], axis=1)
    y = df['mean_score']
    ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform='pandas')
    education_levels = [
        "some high school",
        "high school",
        "some college",
        "associate's degree",
        "bachelor's degree",
        "master's degree",
        "doctorate"]
    oe  = OrdinalEncoder(categories=[education_levels])
    # encoded_data = pd.DataFrame(ohe.fit_transform(df[categorical_attributes]))
    # df = df.join(encoded_data)
    # encoded_data = pd.DataFrame(ohe.fit_transform(df[ordinal_attributes]))
    # df = df.join(encoded_data)
    column_transformer = make_column_transformer(
        (ohe, categorical_attributes),
        (oe, ordinal_attributes))
    # End data cleaning

    # Model pipelines
    # linear and gradient boost regressions
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
    lm = LinearRegression()
    lm_pipeline = make_pipeline(column_transformer, lm)
    lm_pipeline.fit(X_train, y_train)
    lm_predictions = lm_pipeline.predict(X_test)
    lm_mae = mean_absolute_error(lm_predictions, y_test)
    lm_rmse = np.sqrt(mean_squared_error(lm_predictions, y_test))

    gbm = GradientBoostingRegressor()
    gbm_pipeline = make_pipeline(column_transformer, gbm)
    gbm_pipeline.fit(X_train, y_train)
    gbm_predictions = gbm_pipeline.predict(X_test)
    gbm_mae = mean_absolute_error(gbm_predictions, y_test)
    gbm_rmse = np.sqrt(mean_squared_error(gbm_predictions, y_test))
duffymo
  • 305,152
  • 44
  • 369
  • 561