5

I am using RFECV for feature selection in scikit-learn. I would like to compare the result of a simple linear model (X,y) with that of a log transformed model (using X, log(y))

Simple Model: RFECV and cross_val_score provide the same result (we need to compare the average score of cross-validation across all folds with the score of RFECV for all features: 0.66 = 0.66, no problem, results are reliable)

Log Model: the Problem: it seems that RFECV does not provide a way to trasnform the y. the scores in this case are 0.55 vs 0.53. This is quite expected though, because I had to manually apply np.log to fit the data: log_seletor = log_selector.fit(X,np.log(y)). This r2 score is for y = log(y), with no inverse_func, while what we need is a way to fit the model on the log(y_train) and calculate the score using exp(y_test). Alternatively, if I try to use the TransformedTargetRegressor, I get the error shown in the code: The classifier does not expose "coef_" or "feature_importances_" attributes

How do I resolve the problem and make sure that the feature selection process is reliable?

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
                                                func=np.log,
                                                inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y) 
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))

print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features") 
print("no of feat: ", selector.n_features_ )

print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2)) 
print("no of feat: ", log_selector.n_features_ )

Output:

**Simple Model**
RFECV, r2 scores:  [0.45 0.6  0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score:  0.66 , same as RFECV score with all features
no of feat:  6

**Log Model**
RFECV, r2 scores:  [0.39 0.5  0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score:  0.55
no of feat:  3
towi_parallelism
  • 1,421
  • 1
  • 16
  • 38

2 Answers2

4

All you need to do is to add such properties to the TransformedTargetRegressor:

class MyTransformedTargetRegressor(TransformedTargetRegressor):
    @property
    def feature_importances_(self):
        return self.regressor_.feature_importances_

    @property
    def coef_(self):
        return self.regressor_.coef_

Then in you code, use that:

log_estimator = MyTransformedTargetRegressor(regressor=linear_model.LinearRegression(),
                                             func=np.log,
                                             inverse_func=np.exp)
Computer_guy
  • 807
  • 2
  • 11
  • 19
1

One workaround for this problem is to make sure coef_ attribute is exposed to the feature selection module RFECV. So basically you need extent the TransformedTargetRegressor and make sure it exposes the attribute coef_. I have created a the child class that will extent from TransformedTargetRegressor and also exposes coef_ as shown below.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np

class myestimator(TransformedTargetRegressor):

    def __init__(self,**kwargs):
        super().__init__(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)

    def fit(self, X, y, **kwargs):
        super().fit(X, y, **kwargs)  
        self.coef_ = self.regressor_.coef_
        return self

And then you can use myestimator to create your code as shown below:

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = myestimator(regressor=LinearRegression(),func=np.log,inverse_func=np.exp)

selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,y) 

I have run your sample code and shown the result.

SAMPLE OUTPUT

print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features") 
print("no of feat: ", selector.n_features_ )

print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2)) 
print("no of feat: ", log_selector.n_features_ )


**Simple Model**
RFECV, r2 scores:  [0.45 0.6  0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score:  0.66 , same as RFECV score with all features
no of feat:  6
**Log Model**
RFECV, r2 scores:  [0.41 0.51 0.59 0.59 0.58 0.56 0.54 0.53 0.55 0.55]
cross_val, mean r2 score:  0.55
no of feat:  4
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Parthasarathy Subburaj
  • 4,106
  • 2
  • 10
  • 24
  • thanks! seems to be a way to resolve it, but how would I import the modified code? RFECV itself uses RFE, so things become a bit messy.. Plus, your sample output is just a sample I assume. right? you did not actually run the code? because the numbers do not make sense. the second one should give me 0.55 at the end and not 0.66.. – towi_parallelism Oct 07 '19 at 21:12
  • Instead of modifying multiple files, I have made it simpler now! you just need to replace your `rfe.py` file in your local system (can be found at /sklearn/feature_selection/rfe.py) with the file that I have posted in the github gist. – Parthasarathy Subburaj Oct 08 '19 at 08:45
  • thanks. that could work, but not ideal, as it will make my code non-portable and non-reproducible. Can anything else be done without touching the Scikit source? – towi_parallelism Oct 08 '19 at 12:27
  • 1
    Yeah that's a valid point, I made the necessary changes to the code. And now you don't have to worry about touching the source code and it's very easily portable and reproducible. – Parthasarathy Subburaj Oct 08 '19 at 14:50
  • 1
    Thanks @Parthasarathy. I got two correct answers. Will upvote for now – towi_parallelism Oct 11 '19 at 18:03