1

When I was looking up on how a step in sklearn.Pipeline can be prepared to operate only on some columns, I stumbled upon sklearn.Pipeline.FeatureUnion from this answer on stackoverflow. But, I couldn't quite figure out, how to not apply anything to the columns I don't want to and pass complete data to next step. for example, in my first step, I want to apply StandardScaler only on some columns, it can be done with code shown below, but the problem is next step will only have columns that were standard scaled. How do I have complete data in next step with columns standard scaled from previous step?

Here's some example code:

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]


pipe = Pipeline([
    # steps below applies on only some columns
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=[list of numeric column names]), StandardScaler())),
    ])),
    ('feature_engineer_step1', FeatEng_1()),
    ('feature_engineer_step2', FeatEng_2()),
    ('feature_engineer_step3', FeatEng_3()),
    ('remove_skew', Skew_Remover()),

    # below step applies on all columns
    ('model', RandomForestRegressor())
])

EDIT:

Since the chosen answer doesn't have any example code, I am pasting mine here, for anyone that may encounter this question and expect to find code that works. data used in below example is, the california housing data that comes with google colab.

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

# writing a column transformer that operates on some columns
num_cols = ['housing_median_age', 'total_rooms','total_bedrooms', 'population', 'households', 'median_income']
p_stand_scaler_1 = ColumnTransformer(transformers=[('stand_scale', StandardScaler(), num_cols)],
                                     # set remainder to passthrough to pass along all the un-specified columns untouched to the next steps
                                     remainder='passthrough')

# make a pipeline now with all the steps
pipe_1 = Pipeline(steps=[('standard_scaler', p_stand_scaler_1),
                         ('rf_regressor', RandomForestRegressor(random_state=100))])

# pass the data now to fit
pipe_1.fit(house_train.drop('median_house_value', axis=1), house_train.loc[:,'median_house_value'])

# make predictions
pipe_predictions = pipe_1.predict(house_test.drop('median_house_value', axis=1))
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Naveen Reddy Marthala
  • 2,622
  • 4
  • 35
  • 67
  • if feat_eng1() is a general purpose transform that can be applied to any two columns like additon,subtraction . So i want to pass a feature name to the transformer and store the result in the data frame with this feature name is this possible – megjosh Dec 05 '20 at 14:20
  • I couldn't understand you completely. you want your transformer to take complete dataframe as input and apply some transformation function like addition or subtraction to two columns inplace, yes? if yes, that is possible and I have done it. – Naveen Reddy Marthala Dec 05 '20 at 15:43
  • def transform(self,X,y=None): for tpl in self.dif_columns: print('diff caclulator') X.loc[:,tpl[2]] = X[tpl[0]] - X[tpl[1]] return X This is what I am trying to do .I am passing a tuple('AMT_PMT','AMT_INST','AMT_PER') to the constructor.the tuple has the two columns and the third item is the new column name . I am getting two errors 1. settingwithcopy use loc 2. keyword cannot be expression. The second error is the new column name has to be hardcoded and cannot be an expression. hence could you share your feat_eng1() structure – megjosh Dec 05 '20 at 18:09
  • please open a new question and post it's url here. i will try to answer. it's just not easy to look at code and understand it this way. – Naveen Reddy Marthala Dec 05 '20 at 18:45
  • Link to the issue I talked in the comment above https://stackoverflow.com/questions/65164203/using-loc-inside-custom-transformer-produces-copy-with-slice-error – megjosh Dec 06 '20 at 02:53

2 Answers2

5

You can use a ColumnTransformer from sklearn. Here's a snippet to help you.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

#transform columns
#num_cols = numerical columns, categorical_col = categorical columns
preprocessor = ColumnTransformer(transformers = [('minmax',MinMaxScaler(), num_cols),
                                                 ('onehot', OneHotEncoder(), categorical_col)])

#model
model = RandomForestClassifier(random_state=0)

#model pipeline
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

model_pipeline.fit(x_train, y_train)

Mohil Patel
  • 437
  • 3
  • 9
  • In the example, you have used minm max scaler on numerical columns and one hot encoder on categorical columns. What changes would i have to make in that example to only apply min max scaler in that step and let all other columns intact and pass all of that to next step? – Naveen Reddy Marthala Oct 09 '20 at 03:52
  • 1
    Just remove the one hot encoder from ColumnTransformer – Mohil Patel Oct 09 '20 at 04:08
  • 1
    I found out that removing it will drop the non-specified columns in the column transformer and only pass the specified ones to the next step. an argument called `remainder='passthrough'` has to explicitly set to not drop and include the non-specified columns in a column transformer. – Naveen Reddy Marthala Oct 09 '20 at 06:00
2

I believe that using a ColumnTransformer (from sklearn.compose import ColumnTransformer) should do the trick.

When instantiating your column transformer, you can set remainder='passthrough', which will just leave the remaining columns unaltered.
Then you instantiate a pipeline object with the column transformer as the first step.
That way, the next pipeline step will receive all columns as wanted.

nick
  • 46
  • 4