18

This is the dataset with 3 cols and 3 rows

Name Organization Department

Manie   ABC2 FINANCE

Joyce   ABC1 HR

Ami   NSV2 HR

This is the code I have:

Now it is fine till here, how do i drop the first dummy variable column for each ?

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data1.csv',encoding = "cp1252")
X = dataset.values


# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()
Max Power
  • 8,265
  • 13
  • 50
  • 91
Vijay
  • 320
  • 2
  • 3
  • 12
  • 2
    pandas has `get_dummies()`, which has a parameter `drop_first` you can set to True. Here's an example of using get_dummies: https://stackoverflow.com/a/43971156/1870832 – Max Power Jun 17 '17 at 06:41
  • Hey Max Power, I tried X = pd.get_dummies(X, drop_first=True)), but its showing an error SyntaxError: invalid syntax – Vijay Jun 17 '17 at 06:54
  • 1
    see my answer below and tested output. I'm guessing your syntax error is from another part of your code. – Max Power Jun 17 '17 at 07:03
  • Max, I tried urcode and it works, but when i replace df with X, it throws as error. This is probably because X is not in the form of a dataframe, since I have imported the csv into dataset and then later taken X = dataset.iloc[:, :].values. I've done this because this is a part of a much larger project and i have simplified for stackoverflow. I willl need to split the dataset into X and y – Vijay Jun 17 '17 at 07:14
  • 1
    your X comes in from `read_csv` as a Pandas DatafFrame. try passing that `dataset` to `pd.get_dummies()` before taking `.values`. If you want the one-hot-encoded output to be a numpy array, you can take `.values` of the output of `pd.get_dummies` – Max Power Jun 17 '17 at 07:16
  • Thanks a lot Max Power. You're amazing! – Vijay Jun 17 '17 at 07:18
  • happy to help. good luck with the rest of the project. – Max Power Jun 17 '17 at 07:20
  • Please let me know how to select particular columns by indexes. I want to drop the first column and am trying x = pd.get_dummies(X, columns =[1:],drop_first=True), but its not working. – Vijay Jun 17 '17 at 07:38
  • 1
    see my updated answer. sorry for the delay, I think we're in different timezones, I went to asleep. – Max Power Jun 17 '17 at 14:30
  • Yes we are in different timezones :-)... Thanks for ur reply.. Wish u a pleasant weekend. – Vijay Jun 17 '17 at 15:14
  • See my answer below. You can use OneHotEncoder starting sklearn version 0.21. – Jyoti Prasad Pal Oct 15 '19 at 03:49

7 Answers7

24
import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
                   'Org':  ['ABC2', 'ABC1', 'NSV2'],
                   'Dept': ['Finance', 'HR', 'HR']        
        })


df_2 = pd.get_dummies(df,drop_first=True)

test:

print(df_2)
   Dept_HR  Org_ABC2  Org_NSV2  name_Joyce  name_Manie
0        0         1         0           0           1
1        1         0         0           1           0
2        1         0         1           0           0 

UPDATE regarding your error with pd.get_dummies(X, columns =[1:]:

Per the documentation page, the columns parameter takes "Column Names". So the following code would work:

df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)

output:

    name  Org_ABC2  Org_NSV2  Dept_HR
0  Manie         1         0        0
1  Joyce         0         0        1
2    Ami         0         1        1

If you really want to define your columns positionally, you could do it this way:

column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)
Max Power
  • 8,265
  • 13
  • 50
  • 91
5

I use my own template for doing that:

from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):

    def __init__(self):
        """Encode the data.

        Columns of data type object are appended in the list. After 
        appending Each Column of type object are taken dummies and 
        successively removed and two Dataframes are concated again.

        """
    def fit(self, X, y=None):
        self.object_col = []
        for col in X.columns:
            if(X[col].dtype == np.dtype('O')):
                self.object_col.append(col)
        return self

    def transform(self, X, y=None):
        dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
        X = X.drop(X[self.object_col],axis=1)
        X = pd.concat([dummy_df,X],axis=1)
        return X

And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:

from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)

And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.

MD Rijwan
  • 471
  • 6
  • 15
3

It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')
Jyoti Prasad Pal
  • 1,569
  • 3
  • 26
  • 41
1

I use my own module for dealing with one hot encoding.

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class My_encoder(BaseEstimator, TransformerMixin):
   
    def __init__(self,drop = 'first',sparse=False):
        self.encoder = OneHotEncoder(drop = drop,sparse = sparse)
        self.features_to_encode = []
        self.columns = []
    
    def fit(self,X_train,features_to_encode):
        
        data = X_train.copy()
        self.features_to_encode = features_to_encode
        data_to_encode = data[self.features_to_encode]
        self.columns = pd.get_dummies(data_to_encode,drop_first = True).columns
        self.encoder.fit(data_to_encode)
        return self.encoder
    
    def transform(self,X_test):
        
        data = X_test.copy()
        data.reset_index(drop = True,inplace =True)
        data_to_encode = data[self.features_to_encode]
        data_left = data.drop(self.features_to_encode,axis = 1)
        
        data_encoded = pd.DataFrame(self.encoder.transform(data_to_encode),columns = self.columns)
        
        return pd.concat([data_left,data_encoded],axis = 1)

Its pretty easy to use

features_to_encode = [---list of features to one hot encode--]
enc = My_encoder()
enc.fit(X_train,features_to_encode)
X_train = enc.transform(X_train)
X_test = enc.transform(X_test)

It returns dataframe with columns names. So, covers both the disadvantages of OneHotEncoder and pd.get_dummies(). So, we can use it to fit and transform, like OneHotEncoder, and also it saves us the column names and returns a datafram like dummies approach.

Hardik Kamboj
  • 81
  • 1
  • 7
0

Encode the categorical variables one at a time. The dummy variables should go to the beginning index of your data set. Then, just cut off the first column like this:

X = X[:, 1:]

Then encode and repeat the next variable.

Robert
  • 171
  • 1
  • 14
0

Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. Create a separate pipeline for categorical and numerical variable and apply ColumnTransformer. More info about it can be found here ColumnTransformer.

Another great example of implementation of this is provided here.

-2
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in range(Y.shape[1]):
    Y[:,i] = le.fit_transform(Y[:,i])