What is difference between pd.get_dummies and sklearn one hot encoder in python ? As per my knowledge both do same works,Can any one tells what is the main difference between pd.get_dummies and sklearn one hot encoder ,on which one is more efficient at present.
Asked
Active
Viewed 3,602 times
0
-
thanks for your prompt reply,but which one is more efficient as compare to both ways,the differennce is only i identify in that post, pandas get_duimmies can directly converts strings columns data into integer columns , where in case of one hot encoder we have explicilty define our mapping , After that it will convert other than this is there difference . – Nandini Matam Mar 11 '19 at 10:37
2 Answers
4
1. Output difference
pd.get_dummies
results to a Pandas DataFrame whereas OneHotEncoder
results a SciPy CSR matrix.
Example -
s = pd.Series([1, 2, 3, 4, 5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
type(pd.get_dummies(s))
pandas.core.frame.DataFrame
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit_transform(s.values.reshape(-1, 1))#.toarray() # Can be converted to NumPy ndarray using .toarray
scipy.sparse.csr.csr_matrix
2. Time complexity
pd.get_dummies
is much faster than the OneHotEncoder
Example -
s = pd.Series([1, 2, 3, 4, 5]*50000)
len(s)
250000
%timeit pd.get_dummies(s)
15.2 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit enc.fit_transform(s.values.reshape(-1, 1))
34.1 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit enc.fit_transform(s.values.reshape(-1, 1)).toarray() # more reusable
45.3 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3. Input data dependency
As explained in the old post

meW
- 3,832
- 7
- 27
-
1Thanks for clear explination on my query.Present it was clean and clear. – Nandini Matam Mar 11 '19 at 10:51
1
I feel one of the key differentiation is .transform
in one hot encoder.
If you are planning use this dummy variables generation on the test data and let us assume we have following situation:
enc.transform(pd.Series([1,3,2]).values.reshape(-1,1)).toarray()
#it will create consistent columns as that of the training data (all 5 features)
array([[1., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 1., 0., 0., 0.]])
But pd.get_dummies work independently on the test data
pd.get_dummies(pd.Series([1,3,2]))
#
1 2 3
0 1 0 0
1 0 0 1
2 0 1 0

Venkatachalam
- 16,288
- 9
- 49
- 77