How to standardize only numerical columns in pipeline for machine learning?

Question

I have data with numeric and categorical features; I would like to standardize the numerical features only. The columns of numerical values are captured in X_num_cols, however I am not sure how I can implement that into the Pipeline code, for example, make_pipeline(preprocessing.StandardScaler(columns=X_num_cols) doesn't work. I have found this on stackoverflow, but the answers don't fit my code layout/purpose.

from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
import pandas as pd
import numpy as np

# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)

# Retain only the needed predictors
X = X.filter(['age', 'gender', 'ccis'])

# Find the numerical columns, exclude categorical columns
X_num_cols = X.columns[X.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.5, 
                                                    random_state=1234, 
                                                    stratify=y)

# Pipeline
pipeline = make_pipeline(preprocessing.StandardScaler(),
            LogisticRegression(penalty='l2'))

# Declare hyperparameters
hyperparameters = {'logisticregression__C' : [0.01, 0.1, 1.0, 10.0, 100.0],
                  'logisticregression__multi_class': ['ovr'],
                  'logisticregression__class_weight': ['balanced']
                  }

# SKlearn cross-validation with pupeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)

Sample data is as follows:

Age    Gender    CCIS
13     M         5
24     F         8

Can you add a small sample of you data following the guidelines from [this post](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — DJK, Feb 15 '18 at 19:24
Did you see Marcus V's answer based on FeatureUnion in the link you have referenced? — KRKirov, Feb 15 '18 at 19:54
Yes, but can't quite make out the code's logic, and thus can't implement. I also try to mimic the code, but numeric and categorical lines gave me error. — KubiK888, Feb 15 '18 at 20:13
@KubiK888 I learned pipelines reading [this](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html) post. I think those flowcharts make quite clear how pipeline and feature union can work together and be nested. In fact, I like to draw similar boxes myself if things get complex. — Marcus V., Feb 16 '18 at 08:55
Regarding the numeric and categegorical lines: those were taken from the origina question. Of course they should be the lists of column names according to your problem. So for instance "X_num_cols" in your case. — Marcus V., Feb 16 '18 at 08:57

score 2 · Answer 1 · answered Mar 12 '18 at 09:28

Your pipeline should be like this:

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion


rg = LogisticRegression(class_weight = { 0:1, 1:10 }, random_state = 42, solver = 'saga',max_iter=100,n_jobs=-1,intercept_scaling=1)


pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)

How to standardize only numerical columns in pipeline for machine learning?

1 Answers1