0

I am tring to use the OneHotEncoder in sklearn for the kaggle dataset fish market. The dataset has several columns but only one categorical column, which is 'Species'. It has seven categories, so the output should be a 7 digit binary sequence for each sample. The output of encoded_df seems fine as well as the training dataset. However, when i try to concatenate them together, it has several NaN in the DataFrame. Can anyone please explain this to me?

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
fish_df = pd.read_csv('Fish.csv')

# Splitting the data
X = fish_df.drop('Weight', axis=1)
y = fish_df['Weight']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=1)

# Define the columns to be encoded
categorical_cols = ['Species']

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the categorical columns
encoded_df = pd.DataFrame(encoder.fit_transform(X_train[categorical_cols]))

# Concatenate the encoded features with the original features
X_train_trial = pd.concat([X_train.drop(categorical_cols, axis=1), encoded_df], axis=1)

X_train_trial

I tried to concatenate the two dataframes together along axis=1 and there are NaN in the dataframes. The image of the dataframe

Frank
  • 1

0 Answers0