When I was looking up on how a step in sklearn.Pipeline can be prepared to operate only on some columns, I stumbled upon sklearn.Pipeline.FeatureUnion from this answer on stackoverflow. But, I couldn't quite figure out, how to not apply anything to the columns I don't want to and pass complete data to next step. for example, in my first step, I want to apply StandardScaler
only on some columns, it can be done with code shown below, but the problem is next step will only have columns that were standard scaled. How do I have complete data in next step with columns standard scaled from previous step?
Here's some example code:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
pipe = Pipeline([
# steps below applies on only some columns
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=[list of numeric column names]), StandardScaler())),
])),
('feature_engineer_step1', FeatEng_1()),
('feature_engineer_step2', FeatEng_2()),
('feature_engineer_step3', FeatEng_3()),
('remove_skew', Skew_Remover()),
# below step applies on all columns
('model', RandomForestRegressor())
])
EDIT:
Since the chosen answer doesn't have any example code, I am pasting mine here, for anyone that may encounter this question and expect to find code that works. data used in below example is, the california housing data that comes with google colab.
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
# writing a column transformer that operates on some columns
num_cols = ['housing_median_age', 'total_rooms','total_bedrooms', 'population', 'households', 'median_income']
p_stand_scaler_1 = ColumnTransformer(transformers=[('stand_scale', StandardScaler(), num_cols)],
# set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
remainder='passthrough')
# make a pipeline now with all the steps
pipe_1 = Pipeline(steps=[('standard_scaler', p_stand_scaler_1),
('rf_regressor', RandomForestRegressor(random_state=100))])
# pass the data now to fit
pipe_1.fit(house_train.drop('median_house_value', axis=1), house_train.loc[:,'median_house_value'])
# make predictions
pipe_predictions = pipe_1.predict(house_test.drop('median_house_value', axis=1))