0

I'm using sklearn's OneHotEncoder, but want to untransform my data. any idea how to do that?

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9])
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

but I want to be able to do the following:

>>> enc.untransform(array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]]))
[[0, 1, 1]]

How would I go about doing this?

For context, I've built a neural network that learns the one-hot encoding space, and want to now use the nn to make real predictions that need to be in the original data format.

kmace
  • 1,994
  • 3
  • 23
  • 39
  • I notice that sklearn.feature_extraction.DictVectorizer has an inverse_transform method. – kmace Jun 08 '16 at 04:59
  • just found this answer, it's very elaborated but it may help you http://stackoverflow.com/questions/22548731/how-to-reverse-sklearn-onehotencoder-transform-to-recover-original-data – Guiem Bosch Jun 08 '16 at 05:58

1 Answers1

1

For Inverting a single one hot encoded item
see: https://stackoverflow.com/a/39686443/7671913

from sklearn.preprocessing import OneHotEncoder
import numpy as np

orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6])

ohe = OneHotEncoder()
encoded = ohe.fit_transform(orig.reshape(-1, 1)) # input needs to be column-wise

decoded = encoded.dot(ohe.active_features_).astype(int)
assert np.allclose(orig, decoded)

For Inverting an array of one hot coded items see (as stated in the comments)
see: How to reverse sklearn.OneHotEncoder transform to recover original data?

Given the sklearn.OneHotEncoder instance called ohc, the encoded data (scipy.sparse.csr_matrix) output from ohc.fit_transform or ohc.transform called out, and the shape of the original data (n_samples, n_feature), recover the original data X with:

recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices])
            .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]
bmjrowe
  • 306
  • 1
  • 2
  • 15
  • Since version 0.20 of scikit-learn, the `active_features_` attribute of the OneHotEncoder class has been deprecated. – gented Jan 20 '20 at 11:25