I have data with numeric and categorical features; I would like to standardize the numerical features only. The columns of numerical values are captured in X_num_cols
, however I am not sure how I can implement that into the Pipeline code, for example, make_pipeline(preprocessing.StandardScaler(columns=X_num_cols)
doesn't work. I have found this on stackoverflow, but the answers don't fit my code layout/purpose.
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
import pandas as pd
import numpy as np
# Separate target from training features
y = df['MED']
X = df.drop('MED', axis=1)
# Retain only the needed predictors
X = X.filter(['age', 'gender', 'ccis'])
# Find the numerical columns, exclude categorical columns
X_num_cols = X.columns[X.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5,
random_state=1234,
stratify=y)
# Pipeline
pipeline = make_pipeline(preprocessing.StandardScaler(),
LogisticRegression(penalty='l2'))
# Declare hyperparameters
hyperparameters = {'logisticregression__C' : [0.01, 0.1, 1.0, 10.0, 100.0],
'logisticregression__multi_class': ['ovr'],
'logisticregression__class_weight': ['balanced']
}
# SKlearn cross-validation with pupeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
Sample data is as follows:
Age Gender CCIS
13 M 5
24 F 8