1

My question refers to a problem which has been raised in the following similar unanswered question: Using a Pipeline containing ColumnTransformer in SciKit's RFECV

I am trying to select the most relevant features with RFECV with a pipeline containing ColumnTransformer with the following code:

from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import  ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.linear_model import HuberRegressor as hr

ind_nums_area=list(extra_enc_area_tr_data.columns.get_indexer(nums_area))

huber_prep_pipe = Pipeline([
       ('scaler',StandardScaler())
   ])
huber_col_transf= ColumnTransformer ([
    ('prep',huber_prep_pipe,ind_nums_area) 
],remainder='passthrough')

huber_pipe = Pipeline([
    ('transf',huber_col_transf),
    ('est',hr())
])
huber_ttr=TransformedTargetRegressor(regressor=huber_pipe,transformer=MinMaxScaler())
min_features=100
huber_pipe_rfecv=RFECV(huber_ttr,min_features_to_select=min_features,
cv=5,scoring='neg_root_mean_squared_error',n_jobs=-1,verbose=3,
importance_getter='regressor_.named_steps.est.coef_')
huber_pipe_rfecv.fit(extra_enc_area_tr_data,log_tr_target)

ind_nums_area is a list of indices of features to be transformed by the ColumnTransformer (list of features is the nums_area variable). I am using indices as RFECV transforms the training dataset which is a pandas dataframe to a numpy array and column names are not allowed of course. Unfortunately the indices are not the way to go either as number of features are reduced by RFECV and listed indices are not correct. One of the features to be transformed by ColumnTransformer is indice 255 of the last feature of the training data and in that case I get an error IndexError: index 255 is out of bounds for axis 0 with size 255 ValueError: all features must be in [0, 254] or [-255, 0].

Do you have any suggestion how to solve this issue? Is there any simple way to find out which feature has been removed in each iteration of RFECV and adjust the the indices accordingly? Maybe you have a suggestion how to avoid converting the dataframe to a numpy array by RFECV? Otherwise do you know some alternative to sklearn RFCEV that does not convert the training dataframe to a numpy array?

I do not want transform all the data before passing it to the estimator as this would leak the scaler information to the test folds.

How to deal with this without data leakage?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Wlodek K
  • 70
  • 10

2 Answers2

1

I have been trying to solve this problem in a similar constellation for some time until I got fed up by the complicated internal conversions of scikit-learn and decided to write my own rfecv working with pipelined transformers.

Basically I implemented the algorithm from Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines." Machine learning 46.1 (2002): 389-422, that is also the base for the scikit-learn implementation.

def rfecv(X, y, estimator,
          min_features_to_select=3, 
          splits=3,
          step=3,
          scoring_metric="f1",
          scoring_decimals=3,
          random_state=None):
    """
    This method is an implementation the recursive feature eliminationalgorithm, 
    which eliminates unneccessary features. As scikit-learn only provides an RFECV 
    version [1] that makes using Pipelines very difficult, we have implemented our 
    own version based on the original paper [2].
    
    [1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html
    [2] Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines." 
        Machine learning 46.1 (2002): 389-422.

    :X: a DataFrame containing the features.
    :y: a Series containing the labels.
    :estimator: a scikit-learn estimator or a Pipeline. If a pipeline is passed,
        the last element of the pipeline is assumed to be a classifier providing
        a feature_importances_ attribute.
    :min_features_to_select: the minimum number of features to evaluate.
    :split: number of splits for to use for cross validation.
    :step: the amount of features to be reduced during each step.
    :scoring_metric: the scoring metric to use for evaluation (e.g., "f_one" or 
        a callable implementing the sklearn scoring interface).
    :scoring_decimals: the scoring metric can be rounded to N decimals to avoid 
        the reduction from getting stuck with a larger number of features with
        very small score gains. Defaults to 3 digits. If None is passed, full
        scoring precision is used.
    :random_state: if not None, this is the seed for all RNGs used in this function.
        
    :returns: best_features, best_score, ranks, scores; best_features is a list
        of features, best_score is the mean score achieved with these features over the
        folds, ranks is the order of eliminated features (from most relevant to most irrelevant),
        scores is the list of mean scores for each step achieved during the feature 
        elimination across all folds.
    """
    # Initialize survivors and ranked list
    survivors = list(X.columns)
    ranks = []
    scores = []
    
    # While the survivor list is longer than min_features_to_select
    while len(survivors) >= min_features_to_select:
                
        # Get only the surviving features
        X_tmp = X[survivors]
        
        # Train and get the scores, cross_validate clones 
        # the model internally, so this does not modify
        # the estimator passed to this function
        print("[%.2f] evaluating %i features ..." % (time(), len(X_tmp.columns)))
        cv_result = cross_validate(estimator, X_tmp, y,
                                   cv=KFold(n_splits=splits, 
                                            shuffle=True, 
                                            random_state=random_state),
                                   scoring=scoring_metric,
                                   return_estimator=True)
        
        # Append the mean performance to 
        score = np.mean(cv_result["test_score"])
        if scoring_decimals is None:
            scores.append(score)
        else:
            scores.append(round(score, scoring_decimals))            
        print("[%.2f] ... score %f." % (time(), scores[-1]))
        
        # Get feature weights from the model fitted 
        # on the best fold and square the weights as described 
        # in the paper. If the estimator is a Pipeline,
        # we get the weights from the last element.
        best_estimator = cv_result["estimator"][np.argmax(cv_result["test_score"])]
        if isinstance(best_estimator, Pipeline):
            weights = best_estimator[-1].feature_importances_
        else:
            weights = best_estimator.feature_importances_
        weights = list(np.power(weights, 2))
                
        # Remove step features (but respect min_features_to_select)
        for _ in range(max(min(step, len(survivors) - min_features_to_select), 1)):
            
            # Find the feature with the smallest ranking criterion
            # and update the ranks and survivors
            idx = np.argmin(weights)
            ranks.insert(0, survivors.pop(idx))
            weights.pop(idx)
            
    # Calculate the best set of surviving features
    ranks_reverse = list(reversed(ranks))
    last_max_idx = len(scores) - np.argmax(list(reversed(scores))) - 1
    removed_features = set(ranks_reverse[0:last_max_idx * step])
    best_features = [f for f in X.columns if f not in removed_features]
    
    # Return ranks and scores
    return best_features, max(scores), ranks, scores

Everything you need to know is documented in the docstring. The only exception is how to interprete the returned ranks and scores lists. In the case of step being 1, the score at scores[i] was achieved by removing all the features in list(reversed(ranks))[0:i] (as ranks is the list of removed features ordered from most relevant to most irrelevant).

A minimum working example with a DecisionTree is shown below (but it of course also works with Pipelines and transformers, if the classifier in the Pipeline is the last element):

Python 3.9.1 (default, Dec 11 2020, 14:32:07) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.tree import DecisionTreeClassifier
>>> test_data = load_breast_cancer(as_frame=True)
>>> clf = DecisionTreeClassifier(random_state=0)
>>> clf.fit(test_data.data, test_data.target)
DecisionTreeClassifier(random_state=0)
>>> best_features, best_score, _, _ = rfecv(test_data.data, test_data.target, clf, step=1, min_features_to_select=1, random_state=0)
[1626774242.35] evaluating 30 features ...
[1626774242.38] ... score 0.944000.
[1626774242.38] evaluating 29 features ...
[1626774242.42] ... score 0.938000.
[1626774242.42] evaluating 28 features ...
[1626774242.47] ... score 0.948000.
[1626774242.47] evaluating 27 features ...
[1626774242.50] ... score 0.934000.
[1626774242.51] evaluating 26 features ...
[1626774242.54] ... score 0.938000.
[1626774242.54] evaluating 25 features ...
[1626774242.58] ... score 0.939000.
[1626774242.58] evaluating 24 features ...
[1626774242.62] ... score 0.941000.
[1626774242.62] evaluating 23 features ...
[1626774242.65] ... score 0.944000.
[1626774242.65] evaluating 22 features ...
[1626774242.68] ... score 0.953000.
[1626774242.68] evaluating 21 features ...
[1626774242.70] ... score 0.940000.
[1626774242.70] evaluating 20 features ...
[1626774242.72] ... score 0.941000.
[1626774242.72] evaluating 19 features ...
[1626774242.75] ... score 0.943000.
[1626774242.75] evaluating 18 features ...
[1626774242.77] ... score 0.942000.
[1626774242.77] evaluating 17 features ...
[1626774242.79] ... score 0.944000.
[1626774242.79] evaluating 16 features ...
[1626774242.80] ... score 0.945000.
[1626774242.80] evaluating 15 features ...
[1626774242.82] ... score 0.935000.
[1626774242.82] evaluating 14 features ...
[1626774242.84] ... score 0.935000.
[1626774242.84] evaluating 13 features ...
[1626774242.86] ... score 0.947000.
[1626774242.86] evaluating 12 features ...
[1626774242.87] ... score 0.950000.
[1626774242.87] evaluating 11 features ...
[1626774242.89] ... score 0.950000.
[1626774242.89] evaluating 10 features ...
[1626774242.91] ... score 0.944000.
[1626774242.91] evaluating 9 features ...
[1626774242.92] ... score 0.948000.
[1626774242.92] evaluating 8 features ...
[1626774242.94] ... score 0.953000.
[1626774242.94] evaluating 7 features ...
[1626774242.95] ... score 0.953000.
[1626774242.95] evaluating 6 features ...
[1626774242.97] ... score 0.949000.
[1626774242.97] evaluating 5 features ...
[1626774242.98] ... score 0.951000.
[1626774242.98] evaluating 4 features ...
[1626774243.00] ... score 0.947000.
[1626774243.00] evaluating 3 features ...
[1626774243.01] ... score 0.950000.
[1626774243.01] evaluating 2 features ...
[1626774243.02] ... score 0.942000.
[1626774243.02] evaluating 1 features ...
[1626774243.03] ... score 0.911000.
>>> print(best_features, best_score)
['area error', 'smoothness error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points'] 0.953

Regards,

Matthias

Matthias
  • 26
  • 2
  • Matthias, thanks for directing me towards the solution! You helped me a lot. Could you please show us the definition of the importance_getter callable. I couldn't figure this one out. Please see below the code the I came up with posted as an answer. Let me know if you have any comments. – Wlodek K Jul 19 '21 at 14:34
  • Hi Wlodek, I'm currently building a production version of the code above (which I found to be still buggy). I will post it as soon as it is finished. – Matthias Jul 19 '21 at 14:54
  • Great! It would be cool to see it. You can take a look at my code. I've corrected some of the bugs in your code. – Wlodek K Jul 19 '21 at 15:05
  • I just updated the post with my state for today. I tested it on a large ML model I am currently developing and it seems to do the job. However, the code is not covered by tests, yet. There may still be bugs to be found ;) For your question regarding importance_getter, I switched to a different method. – Matthias Jul 19 '21 at 15:49
  • Added my final an tested version. – Matthias Jul 20 '21 at 09:46
  • Hi Mathias Sorry but it seems your code is not working properly. It does print out nice results when it is working but when you try confirm them afterwards manually they are much worse even worse then with initial number of features. I printed the X_tmp.columns in the function and their number is different then what is being printed. Why are you using the best estimator's feature importance not an average of all the folds like in my code? You are over-fitting to the best fold this way in my opinion. Is there any particular reason why you are you appending scores and inserting ranks? – Wlodek K Jul 20 '21 at 15:49
  • You are adding them to the lists in two different orders. – Wlodek K Jul 20 '21 at 15:51
  • Have you had a chance to take a look or test my code? It is working properly. I tested it.You would just need to adjust the importance getter. – Wlodek K Jul 20 '21 at 15:59
  • Hi Wlodek, I just tested reproduction of the results with the selected features and it worked fine. Are you aware the implementation is using folding with random shuffling? This may lead to different results if you do not reproduce it in your test. You can use the random_state parameter for that. – Matthias Jul 22 '21 at 20:06
  • Also keep in mind that rfecv is a local optimization method. It may get stuck in local optima based on the data input and the randomness in the estimator. – Matthias Jul 22 '21 at 20:08
  • Mathias I am not using KFold just plain cv=number of splits and I could not get your code to work. I checked once again. The scores are not correct in my opinion. There is too much of a difference between the output score and the manual check using cross_validate. I do not have such a big difference with my code. – Wlodek K Jul 23 '21 at 15:39
0

This is the code I come up with on the basis of the initial Matthias' answer before edits. Please bear in mind that considering that ColumnTransformer is changing the order of the columns the training data columns must be in the same order as CT output order otherwise this code does not work. I am still trying to figure out how to set the importance_getter as the function's argument. Kindly let me know if you have a suggestion how to do that.

def df_rfecv(X, y, estimator,
          min_features_to_select=240, 
          splits=5, 
          scoring="neg_root_mean_squared_error"):
    
    # Initialize survivors, ranks, scores, indexes
    survivors = list(X.columns)
    ranks = ['None removed']
    scores = []
    indexes = []
    
    for i in range (len (X.columns),min_features_to_select-1,-1 ):    
        # Get only the surviving features
        X_tmp = X[survivors]
        
                
        # Train and get the scores
        cr_val = cross_validate(ttr, X_tmp, y,
                               cv=KFold(n_splits=splits,),
                               scoring=scoring,return_estimator =True)
        scores.append(np.mean(cr_val["test_score"]))
        
        # Get squared feature weights
        for n,estimator in enumerate(cr_val['estimator']):
            if n == 0:
                all_coefs=[cr_val['estimator'][n].regressor_.named_steps['est'].coef_]
            else:
                all_coefs = np.concatenate((all_coefs,[cr_val['estimator'][n].regressor_.named_steps['est'].coef_]))
        mean_coefs = np.mean(all_coefs,axis=0)
        weights = np.power(mean_coefs, 2)
        
        # Find the feature with the smallest weight
        idx = np.argmin(weights)
        indexes.append(i)
        # Update ranks and survivors
        if i < len(training_data.columns):   
            try:
                #try to remove deleted feature from columns transformed by ColumnTransformer    
                global prep_trans_cols
                prep_trans_cols.remove(survivors[idx])
            except:
                pass
        if i > min_features_to_select:    
            ranks.append(survivors.pop(idx))
        
    # Return ranks, scores in dataframe with indices indicating the number of features
    return pd.DataFrame({'Feature removed':ranks,'Scores':scores},index=indexes)
Wlodek K
  • 70
  • 10