How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

Question

I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model's coefficients (accessed via SGDClassifier.coef_) because the input data was transformed via scikit-learn's OneHotEncoder.

My original input data X is of shape (12000,11):

X = np.array([[1,4,3...9,4,1],
              [5,9,2...3,1,4],
              ...
              [7,8,1...6,7,8]
              ])

I then applied one hot encoding:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X).toarray()

which produces an array of shape (12000, 696):

X_OHE = np.array([[1,0,1...0,0,1],
                 [0,0,0...0,1,0],
                  ...
                 [1,0,1...0,0,1]
                 ])

I then access the model's coefficients with SGDClassifier.coef_ which produces an array of shape (1,696):

coefs = np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

How do I map the coefficient values back to the original values in X, so I can say something like, "if variable foo has a value of bar, the target variable increases/decreases by bar_coeff"?

Let me know if you need more info on the data or the model parameters. Thank you.

I found one unanswered question about this on SO: How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

Lets say a single feature in original data is converted into 4 features in one-hot encoded data. All these 4 features will have different coefficients, how do you plan to combine these into original feature? — Vivek Kumar, Jul 12 '17 at 00:57
I think you've just asked the same question that I posted above. — NickBraunagel, Jul 12 '17 at 01:01
Yes, thats what I was trying to say. Its not available in the library because its not a fixed way to interpret the results. The other question you linked says that it was asked on Cross-validated, but referred here. I would advise you to again ask this on Cross-validated, but not with the current question form. Rather remove the libraries and programming and just describe about what you want to do, and what may be the best practices to do so. If and when you get any satisfactory way to interpret the one-hot encoded coeffs, try programming it. Hope I am clear. — Vivek Kumar, Jul 12 '17 at 01:05

score 1 · Answer 1 · answered Jul 15 '17 at 03:36

After reviewing this user's detailed explanation of OneHotEncoder here, I was able to create a (somewhat hack-y) approach to relating model coefficients back to the original data set.

Assuming you've correctly setup your OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

enc = OneHotEncoder()
X_OHE = enc.fit_transform(X)   # X and X_OHE as described in question

And you have successfully ran a GLM model, say:

from sklearn import linear_model

clf = linear_model.SGDClassifier()
clf.fit(X_train, y_train)

Which has coefficients clf.coef_:

print clf.coef_
# np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

You can use the below approach to trace the encoded 1's and 0's in X_OHE back to the original values in X. I'd recommend reading the mentioned detailed explanation on OneHotEncoding (link at top), else the below will seem like gibberish. But in a nutshell, the below iterates over each feature in X_OHE and uses the feature_indices parameter internal to enc to make the translation.

import pandas as pd
import numpy as np
results = []

for i in range(enc.active_features_.shape[0]):
    f = enc.active_features_[i]

    index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
    s = len(index_range) - 1
    f_index = index_range[-1]
    f_label_decoded = f - f_index

    results.append({
            'label_decoded_value': f_label_decoded,
            'coefficient': clf.coef_[0][i]
        })

R = pd.DataFrame.from_records(results)

Where R looks like this (I original encoded the names of company departments):

coefficient label_decoded_value
3.929413    DepartmentFoo1
3.718078    DepartmentFoo2
3.101869    DepartmentFoo3
2.892845    DepartmentFoo4
...

So, now you can say, "The target variables increases by 3.929413 when an employee is in department 'Foo1'.

thanks, @NickBraunagel, and did you adapt it for the case when you have encoded **multiple categorical variables** ? — Brigitte Maillère, Nov 18 '18 at 10:13
@BrigitteCharpent - it’s been a while since writing the above code but I believe it will handle anything that’s you’ve encoded as it’s looping over the active features via `enc.active_features_[i]`. — NickBraunagel, Nov 18 '18 at 23:42

How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

1 Answers1

Linked