What order do I run StandardScaler on numeric and categorical using PySpark Pipeline?

Question

I am carrying out logistic regression on a a mix of categorical and numeric features.

I want to create a pipeline, that does some transforms, but also includes a StandardScaler on the numeric columns, but not the categorical columns.

Here is how my pipeline currently looks:

# Categorical Features (which are all strings)
stringindexers = [StringIndexer(
                            inputCol=column, 
                            outputCol=column+"_Index") for column in categorical_columns]
onehotencoder_categorical = OneHotEncoderEstimator(
    inputCols = [column + "_Index" for column in categorical_columns],
    outputCols = [column + "_Vec" for column in categorical_columns])

categorical_columns_class_vector = [col + "_Vec" for col in categorical_columns]
categorical_numerical_inputs = categorical_columns_class_vector + numerical_columns

# Assembler for all columns
assembler = VectorAssembler(inputCols = categorical_numerical_inputs, 
                            outputCol="features")

pipeline = Pipeline(
    stages=[*stringindexers,
            onehotencoder_categorical,
            assembler,
            StandardScaler(
                withStd=True,
                withMean=False,
                inputCol="features",
                outputCol="scaledFeatures")
           ]
            )
pipeline.fit(df_pivot_sample).transform(df_pivot_sample).limit(2).toPandas()

However, this applies the Standard Scaler to all of the columns, including those that have been transformed from being categorical features.

How do I structure the above pipeline, so that I only perform the StandardScaler on the numeric columns? Do I need to change the order of the vector assembler.

Should I change the assembler, to only do the assmebler after I've scaled the numeric columns. And save the new columns with a new column suffix?

This answer seems to indicate this is the right way tot go for scikit-learn, but I don't know how to do it for ML Pyspark.

Convert your categorical variables to one hot encoded. So if you have 5 values for `feature_col_X`, then you end up with `feature_col_X_val_1, feature_col_X_val_2 ... feature_col_X_val_4`. Values of these cols will be 1 or 0. Therefore, you can simply run standard scaler and it won't make any difference. _I think_. I.e. you can run the standard scaler to your whole pipeline (so it comes at the end of the list) — Chuck, Mar 04 '21 at 13:16
@Shubh Other was is to run standard scaler earlier in the list. So run standard scaler on numerical, then add in your categorical and use a vector assembler function to combine them all into one vector column on which to trainyour model so would be `[numerical_vector_assembler, standard_scaler, stringindexer, onehotencoder, vetorassembler]`. Again _I think_ For me in the end, I just made everything numeric. Not recomended — Chuck, Mar 04 '21 at 13:18
The answer here is good https://stackoverflow.com/a/52795027/2254228 if you can repurpose it for pyspark as i mention — Chuck, Mar 04 '21 at 13:22

What order do I run StandardScaler on numeric and categorical using PySpark Pipeline?

0 Answers0