25

I need to transform the independent field from string to arithmetical notation. I am using OneHotEncoder for the transformation. My dataset has many independent columns of which some are as:

Country     |    Age       
--------------------------
Germany     |    23
Spain       |    25
Germany     |    24
Italy       |    30 

I have to encode the Country column like

0     |    1     |     2     |       3
--------------------------------------
1     |    0     |     0     |      23
0     |    1     |     0     |      25
1     |    0     |     0     |      24 
0     |    0     |     1     |      30

I succeed to get the desire transformation via using OneHotEncoder as

#Encoding the categorical data
from sklearn.preprocessing import LabelEncoder

labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

#we are dummy encoding as the machine learning algorithms will be
#confused with the values like Spain > Germany > France
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

Now I'm getting the depreciation message to use categories='auto'. If I do so the transformation is being done for the all independent columns like country, age, salary etc.

How to achieve the transformation on the dataset 0th column only?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Hassaan
  • 3,931
  • 11
  • 34
  • 67

12 Answers12

25

There is actually 2 warnings :

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.

and the second :

The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)

In the future, you should not define the columns in the OneHotEncoder directly, unless you want to use "categories='auto'". The first message also tells you to use OneHotEncoder directly, without the LabelEncoder first. Finally, the second message tells you to use ColumnTransformer, which is like a Pipe for columns transformations.

Here is the equivalent code for your case :

from sklearn.compose import ColumnTransformer 
ct = ColumnTransformer([("Name_Of_Your_Step", OneHotEncoder(),[0])], remainder="passthrough")) # The last arg ([0]) is the list of columns you want to transform in this step
ct.fit_transform(X)    

See also : ColumnTransformer documentation

For the above example;

Encoding Categorical data (Basically Changing Text to Numerical data i.e, Country Name)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
#Encode Country Column
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)
Rodolfo Alvarez
  • 972
  • 2
  • 10
  • 18
CoMartel
  • 3,521
  • 4
  • 25
  • 48
  • 1
    i assigned the X = ct.fit_transform(X) and it has transformed the country column but it removed the age column completely. How do I get the both? transform result + the age column data – Hassaan Jan 25 '19 at 12:10
  • 1
    I made the correction, you have the `remainder` argument to determine what to do with unmodified columns – CoMartel Jan 25 '19 at 12:41
  • okay, the only problem I'm facing right now is ct.fit_transform(X) is returning 'ndarry object of numpy module' which is not supported by the array editor. It is because it is adding dtype='object' in the array. so to overcome this issue if have converted the type of the whole matrix to float. Is it right way? – Hassaan Jan 25 '19 at 13:41
  • 1
    Just a question because the documentation also didn't clear it for me... What is the purpose of "Name"? – Shravya Boggarapu Jun 02 '19 at 11:13
  • `Name`is just the name of the step. You can name it as you want, and it can be useful to call this step in the future, for example if you just need to set/get the parameter of one step – CoMartel Jun 03 '19 at 07:04
  • user remainder='passthrough' as mentioned in documentation like below. transformer = ColumnTransformer( transformers=[ ("Country", # Just a name OneHotEncoder(), # The transformer class [0] # The column(s) to be applied on. ) ], remainder='passthrough' ) – Swarit Agarwal Sep 04 '19 at 10:42
6

As of version 0.22, you can write the same code as below:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([("Country", OneHotEncoder(), [0])], remainder = 'passthrough')
X = ct.fit_transform(X)

As you can see, you don't need to use LabelEncoder anymore.

Plabon Dutta
  • 6,819
  • 3
  • 29
  • 33
5
transformer = ColumnTransformer(
    transformers=[
        ("Country",        # Just a name
         OneHotEncoder(), # The transformer class
         [0]            # The column(s) to be applied on.
         )
    ], remainder='passthrough'
)
X = transformer.fit_transform(X)

Reminder will keep previous data while [0]th column will replace will be encoded

Swarit Agarwal
  • 2,520
  • 1
  • 26
  • 33
5

Dont use the labelencoder and directly use OneHotEncoder.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
A = make_column_transformer(
    (OneHotEncoder(categories='auto'), [0]), 
    remainder="passthrough")

x=A.fit_transform(x)
Naresh Kumar
  • 449
  • 4
  • 5
4

There is a way that you can do one hot encoding with pandas. Python:

import pandas as pd
ohe=pd.get_dummies(dataframe_name['column_name'])

Give names to the newly formed columns add it to your dataframe. Check the pandas documentation here.

Veera Srikanth
  • 446
  • 4
  • 7
  • This is what I used with one more parameter to get rid of dummy trap: drop_first=True – Ali Aug 21 '19 at 00:40
1

I had the same issue and the following worked for me:

OneHotEncoder(categories='auto', sparse=False)

Hope this helps

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
1

Use the following code :-

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [0])], remainder='passthrough')

X = np.array(columnTransformer.fit_transform(X), dtype = np.str)

print(X)
ChrisMM
  • 8,448
  • 13
  • 29
  • 48
1
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
...
onehotencorder = ColumnTransformer(
   [('one_hot_encoder', OneHotEncoder(), [0])],
   remainder='passthrough'                     
)

X = onehotencorder.fit_transform(X)
0
# Data Preprocessing Template

# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

# Splitting the dataset into the Training set and Test set
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

#encoding Categorical Data
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
onehotencoder = ColumnTransformer([("Country", OneHotEncoder(), [0])], remainder = "passthrough")
X = onehotencoder.fit_transform(X)


labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
  • 2
    While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. – dan1st Feb 05 '21 at 11:38
0
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer([('one_hot_encoder', OneHotEncoder(), [0])],remainder='passthrough')
x = py.array(transformer.fit_transform(x), dtype=py.float)


onehotencoder = oneHotEncoder(categorical_features=[0]) 

This code should solve the error.

DaveL17
  • 1,673
  • 7
  • 24
  • 38
0

When updating the code from this:

one_hot_encoder = OneHotEncoder(categorical_features = [0, 1, 4, 5, 6])
X_train = one_hot_encoder.fit_transform(X_train).toarray()

To this:

ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(), [
                       0, 1, 4, 5, 6])], remainder='passthrough')
X_train = np.array(ct.fit_transform(X_train), dtype=np.float)

Note that I had to add dtype=np.float to fix the error message TypeError: can't convert np.ndarray of type numpy.object_.

Where my colums were [0, 1, 4, 5, 6] and 'one_hot_encoder' is anything.

My imports were:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import numpy as np
Zanon
  • 29,231
  • 20
  • 113
  • 126
0

I had a similar challenge because the categorical_feature attribute is depreciated. The sure way is to use 'ColumnTransformer'. This is my code below:

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

companies = pd.read_csv(r'E:\SimpleLearn ML\1000_Companies.csv')
X = companies.iloc[:, :-1].values
y = companies.iloc[:, 4].values
companies.head()

labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:,3])

onehotencoder = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = "passthrough")
X = onehotencoder.fit_transform(X)

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
Happy N. Monday
  • 371
  • 1
  • 3
  • 8