Convert Multiple Categorical Columns to Numeric in Pyspark

Asked Dec 27 '16 at 10:36

Active Dec 27 '16 at 14:54

Viewed 508 times

I've been trying to convert multiple categorical columns into numeric before applying the spark ml pipeline. I understand that we can use StringIndexer, OneHotEncoder and VectorAssembler which are available in spark.ml.feature library.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

cat1Indexer = StringIndexer(inputCol="CatFeature1",outputCol="indexedCat1", handleInvalid="skip")
cat1Encoder = OneHotEncoder(inputCol="indexedCat1", outputCol="CatVector1")
fAssembler = VectorAssembler(inputCols=["CatVector1"],outputCol="features")

However, I've 130 categorical columns in my data. How to apply loop over all the categorical columns to convert them all to numeric, instead of going through each of columns manually as below:

pipeline = Pipeline(stages=[cat1Indexer, cat2Indexer, cat3Indexer,
                        cat1Encoder, cat2Encoder, cat3Encoder,
                        fAssembler])

edited Dec 27 '16 at 14:31

zero323

322,348
103
959
935

asked Dec 27 '16 at 10:36

Nim J

Convert Multiple Categorical Columns to Numeric in Pyspark

0 Answers0