I am carrying out logistic regression on a a mix of categorical and numeric features.
I want to create a pipeline, that does some transforms, but also includes a StandardScaler
on the numeric columns, but not the categorical columns.
Here is how my pipeline currently looks:
# Categorical Features (which are all strings)
stringindexers = [StringIndexer(
inputCol=column,
outputCol=column+"_Index") for column in categorical_columns]
onehotencoder_categorical = OneHotEncoderEstimator(
inputCols = [column + "_Index" for column in categorical_columns],
outputCols = [column + "_Vec" for column in categorical_columns])
categorical_columns_class_vector = [col + "_Vec" for col in categorical_columns]
categorical_numerical_inputs = categorical_columns_class_vector + numerical_columns
# Assembler for all columns
assembler = VectorAssembler(inputCols = categorical_numerical_inputs,
outputCol="features")
pipeline = Pipeline(
stages=[*stringindexers,
onehotencoder_categorical,
assembler,
StandardScaler(
withStd=True,
withMean=False,
inputCol="features",
outputCol="scaledFeatures")
]
)
pipeline.fit(df_pivot_sample).transform(df_pivot_sample).limit(2).toPandas()
However, this applies the Standard Scaler to all of the columns, including those that have been transformed from being categorical features.
How do I structure the above pipeline, so that I only perform the StandardScaler on the numeric columns? Do I need to change the order of the vector assembler.
Should I change the assembler, to only do the assmebler after I've scaled the numeric columns. And save the new columns with a new column suffix?
This answer seems to indicate this is the right way tot go for scikit-learn, but I don't know how to do it for ML Pyspark.