0

What is difference between pd.get_dummies and sklearn one hot encoder in python ? As per my knowledge both do same works,Can any one tells what is the main difference between pd.get_dummies and sklearn one hot encoder ,on which one is more efficient at present.

Nandini Matam
  • 119
  • 1
  • 10
  • thanks for your prompt reply,but which one is more efficient as compare to both ways,the differennce is only i identify in that post, pandas get_duimmies can directly converts strings columns data into integer columns , where in case of one hot encoder we have explicilty define our mapping , After that it will convert other than this is there difference . – Nandini Matam Mar 11 '19 at 10:37

2 Answers2

4

1. Output difference

pd.get_dummies results to a Pandas DataFrame whereas OneHotEncoder results a SciPy CSR matrix.

Example -

s = pd.Series([1, 2, 3, 4, 5])
0    1
1    2
2    3
3    4
4    5
dtype: int64

type(pd.get_dummies(s))
pandas.core.frame.DataFrame

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit_transform(s.values.reshape(-1, 1))#.toarray() # Can be converted to NumPy ndarray using .toarray
scipy.sparse.csr.csr_matrix

2. Time complexity

pd.get_dummies is much faster than the OneHotEncoder

Example -
s = pd.Series([1, 2, 3, 4, 5]*50000)
len(s)
250000

%timeit pd.get_dummies(s)
15.2 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit enc.fit_transform(s.values.reshape(-1, 1))
34.1 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit enc.fit_transform(s.values.reshape(-1, 1)).toarray() # more reusable
45.3 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Input data dependency

As explained in the old post

meW
  • 3,832
  • 7
  • 27
1

I feel one of the key differentiation is .transform in one hot encoder.

If you are planning use this dummy variables generation on the test data and let us assume we have following situation:

enc.transform(pd.Series([1,3,2]).values.reshape(-1,1)).toarray()

#it will create consistent columns as that of the training data (all 5 features)
array([[1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])

But pd.get_dummies work independently on the test data

pd.get_dummies(pd.Series([1,3,2]))

#
    1   2   3
0   1   0   0
1   0   0   1
2   0   1   0
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77