I've been trying to convert multiple categorical columns into numeric before applying the spark ml pipeline.
I understand that we can use StringIndexer
, OneHotEncoder
and VectorAssembler
which are available in spark.ml.feature library.
from pyspark.ml.feature import OneHotEncoder, StringIndexer
cat1Indexer = StringIndexer(inputCol="CatFeature1",outputCol="indexedCat1", handleInvalid="skip")
cat1Encoder = OneHotEncoder(inputCol="indexedCat1", outputCol="CatVector1")
fAssembler = VectorAssembler(inputCols=["CatVector1"],outputCol="features")
However, I've 130 categorical columns in my data. How to apply loop over all the categorical columns to convert them all to numeric, instead of going through each of columns manually as below:
pipeline = Pipeline(stages=[cat1Indexer, cat2Indexer, cat3Indexer,
cat1Encoder, cat2Encoder, cat3Encoder,
fAssembler])