0

im building one hot encoding function from and pandas dataframe and cant figure out how to get the data back into the dataframe. I get :

"IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

How do I reintegrate this back into pandas data frame?

def one_hot_encoder (features, df_to_encode):
    """encoder to encoder  

    Parameters:
    features (list): features to normalise
    df_to_encode (pandas dataframe): dataframe to encode

    Returns:
    dataframe: dataframe to encode 
    """
   from sklearn.preprocessing import OneHotEncoder    
   for column in features: 
        # one hot encoder 
        enc = OneHotEncoder(sparse=False)
        column_norm = column + "_encoded"
        df = enc.fit_transform(df_to_encode[[column]])

    return df

columns_to_one_hot_encode = ["type"]
df = one_hot_encoder(columns_to_one_hot_encode,df)

The data im using is from https://www.kaggle.com/ealaxi/paysim1

resolver101
  • 2,155
  • 11
  • 41
  • 53

2 Answers2

1

You don't need sklearn, you can simply use pandas.get_dummies

import pandas as pd

def one_hot_encoder (features, df_to_encode):
    """encoder to encoder  

    Parameters:
    features (list): features to normalise
    df_to_encode (pandas dataframe): dataframe to encode

    Returns:
    dataframe: dataframe to encode 
    """
    return pd.get_dummies(df_to_encode, columns=features)

columns_to_one_hot_encode = ["type"]
df = one_hot_encoder(columns_to_one_hot_encode, df)
Rodalm
  • 5,169
  • 5
  • 21
  • 1
    Whats the difference between get_dummies and scikits OneHotEncoder ? – resolver101 Nov 11 '21 at 20:40
  • @resolver101 What do you mean? They are two different implementations on two different libraries of the same concept. But if you are using `pandas` anyway, it's easier if use `pandas.get_dummies`. As you can see, this solution is much simpler than the other. I don't know what made you change your mind and accept the other. – Rodalm Nov 11 '21 at 20:47
  • @HarrPlotter You are right that for a simple transformation you can easily use the get.dummies(). However, as stated in this* post as well, OneHotEncoder() has some advantages over get_dummies. The main one, in my opinion, is the fact that you can easily apply the OneHotEncoder object to your test data which uses the same categories and results. Given that the dataset is for a Kaggle competition, I think OneHotEncoder() is the wiser choice. * https://stackoverflow.com/questions/36631163/what-are-the-pros-and-cons-between-get-dummies-pandas-and-onehotencoder-sciki – JonnDough Nov 13 '21 at 07:52
1

You can use the get_feature_names that is built-in SciKit's OneHotEncoder and then subsequently drop the old column. In this way, you can still use OneHotEncoder instead of pd.get_dummies

import pandas as pd

def one_hot_encoder (features, df_to_encode):
    """encoder to encoder  

    Parameters:
    features (list): features to normalise
    df_to_encode (pandas dataframe): dataframe to encode

    Returns:
    dataframe: dataframe to encode 
    """  
    from sklearn.preprocessing import OneHotEncoder
    for column in features: 
       enc = OneHotEncoder(sparse=False)
       df_enc = pd.DataFrame(enc.fit_transform(df_to_encode[[column]]))
       df_enc.columns = enc.get_feature_names([column])
       df_to_encode.drop(column, axis = 1, inplace = True)
       df_fin = pd.concat([df_to_encode, df_enc], axis = 1)

       
       return df_fin


columns_to_one_hot_encode = ["type"]
df = one_hot_encoder(columns_to_one_hot_encode,df)
JonnDough
  • 827
  • 6
  • 25
  • I've used get_Feature _names but looks like its going to be DEPRECATED. It says get_feature_names is deprecated in 1.0 and will be removed in 1.2. – resolver101 Nov 11 '21 at 20:42
  • 1
    Good catch! I didn't get that warning, but after reading SciKit's documentation, I think you can change this to: get_feature_names_out(). – JonnDough Nov 13 '21 at 07:48