Sklearn : how to handle dataframes

Question

I want to process a dataframe using sklearn abilities such as pipelines and columnTransformer.

My current pipeline looks like this

common_pipeline = Pipeline([
('highCorrRemover', HighCorr_remover())
                            ]) # remove columns which feature high correlation

objects_pipeline = Pipeline([
('to_numeric', To_numeric())
                            ])

full_pipeline = ColumnTransformer(
     [('all_selected_columns', common_pipeline, selected_columns),
      ('objects', objects_pipeline, object_columns)],
     remainder='drop')

where To_numeric & highCorrRemover are based on custom classes. Below HighCorr is given and To_numeric follows a similar pattern :

class HighCorr_remover(BaseEstimator, TransformerMixin):
    def __init__(self, corr_threshold=0.7):
        ...
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        corr = X.corr()
        (main_elements, correlated_elements) = remove_high_corr(corr, self.corr_threshold)
        return .drop(correlated_elements, axis=1)

This works as expected but returns a numpy array. My question is therefore : with the current version of sklearn, what is the proprer way of dealing with DataFrames i.e. getting back the dataframes with correct index & columns taking into account the potentially performed dropping of columns ?

I can somehow get around it by implementing a get_feature_names() method in my custom classes and then calling this method specifically :

full_pipeline.transformers_[0][1].steps[0][1].get_feature_names()

But I feel there is maybe a better way to do this ?

(Unrelated to the question, but) You should probably save the columns to be dropped in the `fit` method rather than the `transform` method; otherwise, your testing data may have different columns dropped than the training ones, so that the model predicts on nonsense inputs! — Ben Reiniger, Jun 15 '20 at 13:06
Does this answer your question? [Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?](https://stackoverflow.com/questions/57528350/can-you-consistently-keep-track-of-column-labels-using-sklearns-transformer-api) (and Linked questions from there). — Ben Reiniger, Jun 15 '20 at 13:06

Sklearn : how to handle dataframes

0 Answers0