I want to process a dataframe using sklearn abilities such as pipelines and columnTransformer.
My current pipeline looks like this
common_pipeline = Pipeline([
('highCorrRemover', HighCorr_remover())
]) # remove columns which feature high correlation
objects_pipeline = Pipeline([
('to_numeric', To_numeric())
])
full_pipeline = ColumnTransformer(
[('all_selected_columns', common_pipeline, selected_columns),
('objects', objects_pipeline, object_columns)],
remainder='drop')
where To_numeric
& highCorrRemover
are based on custom classes. Below HighCorr
is given and To_numeric
follows a similar pattern :
class HighCorr_remover(BaseEstimator, TransformerMixin):
def __init__(self, corr_threshold=0.7):
...
def fit(self, X, y=None):
return self
def transform(self, X):
corr = X.corr()
(main_elements, correlated_elements) = remove_high_corr(corr, self.corr_threshold)
return .drop(correlated_elements, axis=1)
This works as expected but returns a numpy array. My question is therefore : with the current version of sklearn, what is the proprer way of dealing with DataFrames i.e. getting back the dataframes with correct index & columns taking into account the potentially performed dropping of columns ?
I can somehow get around it by implementing a get_feature_names()
method in my custom classes and then calling this method specifically :
full_pipeline.transformers_[0][1].steps[0][1].get_feature_names()
But I feel there is maybe a better way to do this ?