3

I am using sklearns' pipeline function, to one hot encode, and to model. Almost exactly as in this post.

After using a Pipeline, I am not able to get tree contributions anymore. Getting this error:

AttributeError: 'Pipeline' object has no attribute 'n_outputs_'

I tried to play around with the parameters of the treeinterpreter, but I am stuck.

Therefore my question: is there any way how we can get the contributions out of a Tree, when we are using sklearns Pipeline?

EDIT 2 - Real data as requested by Venkatachalam:

# Data DF to train model
df = pd.DataFrame(
  [['SGOHC', 'd',   'onetwothree',  'BAN',  488.0580347,    960 ,841,   82, 0.902497027,    841 ,0.548155625    ,0.001078211,   0.123958333 ,1],
   ['ABCDEFGHIJK',  'SOC'   ,'CON','CAN',   680.84, 1638,   0,  0,  0   ,0  ,3.011140743    ,0.007244358,   1   ,0],
   ['Hello',    'AA',   'onetwothree',  'SPEAKER',  5823.230967,    2633,   1494    ,338    ,0.773761714    ,1494,  12.70144386 ,0.005743015,   0.432586403,    8]], 
  columns=['B','C','D','E','F','G','H','I','J','K','L','M', 'N', 'target'])

# Create test and train set (useless, but for the example...) 
from sklearn.model_selection  import train_test_split

# Define X and y 
X = df.drop('target', axis=1)
y = df['target']

# Create Train and Test Sets 
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)


 # Make the pipeline and model 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import ParameterGrid
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

rfr = Pipeline([('preprocess',
                   ColumnTransformer([('ohe',
                                       OneHotEncoder(handle_unknown='ignore'), [1])])),
                  ('rf', RandomForestRegressor())])

rfr.fit(X_train, Y_train)


# The New, Real data that we need to predict & explain! 

new_data = pd.DataFrame(
  [['DEBTYIPL', 'de',   'onetwothreefour',  'BANAAN',   4848.0580347,   923460  ,823441,    5,  0.902497027,    43  ,0.548155625    ,0.001078211,   0.123958333 ],
   ['ABCDEFGHIJK',  'SOC'   ,'CON','CAN23', 680.84, 1638,   0,  0,  0   ,0  ,1.011140743    ,4.007244358,   1   ],
   ['Hello_NO', 'AAAAa',    'onetwothree',  'SPEAKER',  5823.230967,    123,    32  ,22 ,0.773761714    ,1678,  12.70144386 ,0.005743015,   0.432586403]], 
  columns=['B','C','D','E','F','G','H','I','J','K','L','M', 'N'])
new_data.head()

# Predicting the values 
rfr.predict(new_data)

# Now the error... the contributions: 
from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(rfr[-1], rfr[:-1].fit_transform(new_data))

#ValueError: Number of features of the model must match the input. Model n_features is 2 and input n_features is 3 
R overflow
  • 1,292
  • 2
  • 17
  • 37

2 Answers2

3

you can get the final estimator by indexing the pipeline object model[-1]. similarly, we to get a new pipeline (to capture all the transformation steps) excluding the classifier by model[:-1].

Hence, this is what you need to do!

prediction, bias, contributions = ti.predict(model[-1], model[:-1].transform(df))
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    thanks again. It works on the dummy data, however if I run this on my real data (with many new unique values in the test_data), I get the next error: ValueError: Number of features of the model must match the input. Model n_features is 108 and input n_features is 90 Could it be the case, that within the Pipeline, we are missing the skip part? As you mentioned in your previous answer [here](https://stackoverflow.com/questions/64910582/can-we-make-the-ml-model-pickle-file-more-robust-by-accepting-or-ignoring-n) – R overflow Nov 30 '20 at 10:40
  • 1
    no, it should work for any number of unique values as well. Can you provide some reproducible example? – Venkatachalam Nov 30 '20 at 10:55
  • Sorry for my late response (Took quite some time to generate a reproducible example). Please see EDIT2. Thanks a lot @Venkatachalam – R overflow Nov 30 '20 at 13:57
  • 1
    Found the mistake in my solution. It has to be `.transform` instead of `fit_transform`. What it means is that it will reuse the classes it learned during the training phase instead of learning a fresh from the given data. – Venkatachalam Dec 03 '20 at 07:33
  • 1
    Brilliant! Bounty added. Maybe one last question (hope you can help), after having the contributions, I try to get them into a DF: contributions_df = pd.DataFrame(data=contributions, columns= df.columns) , now i have the same issue here (df.columns is not matching the 'transformed' df. Do you know how i can get the df in the right shape too? The error is: ValueError: Shape of passed values is (20277, 108), indices imply (20277, 14) – R overflow Dec 03 '20 at 12:53
  • I think the new column names can be fetched up something like [this](https://stackoverflow.com/a/54648023/6347629) – Venkatachalam Dec 11 '20 at 08:41
1

To access the Pipeline's fitted model, just retrieve the ._final_estimator attribute from your pipeline

from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(model._final_estimator, model[0].fit_transform(df))

notice that one can verify if the estimator is fitted by calling the sklearn util check_is_fitted on it

from sklearn.utils.validation import check_is_fitted
check_is_fitted(model._final_estimator)
Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
  • Thanks @Miguel. I tried your solution with the example above, but it will result in: ValueError: could not convert string to float: 'female' – R overflow Nov 30 '20 at 10:35
  • 1
    yes, One Hot Encoder is missing on `df`, `model[0].fit_transform(df)` should work – Miguel Trejo Nov 30 '20 at 17:00
  • Appreciated @Miguel!! Not sure why, but it also fails on real data (please see Edit 2). Any ideas? – R overflow Dec 01 '20 at 08:05
  • 1
    considering `rfr[:-1].fit_transform(X_train)` and `rfr[:-1].fit_transform(df)` notice that there is a feature in `df` that `X_train` do not has, for example, `df` has in column `C`: `d, SOC, AA`, while `X_train` has only `SOC, AA`. Thus, when splitting your data to tests just make sure both sets contain the same categories. – Miguel Trejo Dec 01 '20 at 15:39
  • thanks! Therefore I used the pipeline functionality (with the ignore parameter). To be able to handle 'real world' data (which may contain new classes). You can find that example [here](https://stackoverflow.com/questions/64910582/can-we-make-the-ml-model-pickle-file-more-robust-by-accepting-or-ignoring-n/64964450?noredirect=1#comment114860151_64964450). That caused the issue, that I am not able to get the tree contributions out of the pipelines model (pipline is dummying and model the data). Hope that you can help me – R overflow Dec 02 '20 at 14:08
  • @Roverflow if you know that you're going to receive only one additional category in the real data, you can apply `.add_dummy_feature` to your train data with a value of `-1` to indicate that it is unknown. However, if you know there are several categories which you can see on real data, I suggest using another encoding type like mean encoding, a complete reference for other encoding types can be found [here](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02). – Miguel Trejo Dec 02 '20 at 17:11