Create "features" column in PySpark with both numerical and categorical variables

Question

I am trying to create a "features" column in Spark using Python in order to be used by the Machine Learning libraries. However, I am having issues including both numerical and categorical variables in the VectorAssembler which generates the "features" column.

cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]

indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]

encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol())) 
for indexer in indexers]

assembler = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "features")

pipeline = Pipeline(stages = indexers + encoders + [assembler])
df = pipeline.fit(df).transform(df)

The pipeline constructed up to now can create a "features" column containing only the categorical variables but I have no idea how to extend it such that the "features" column contains both the categorical and the numerical variables.

Please note that I am using Spark 2.3 along with Python 3.

score 7 · Accepted Answer · answered Apr 16 '18 at 09:13

I have found a way to do it but I not sure if this is the most efficient way of achieving what I want.

cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]

indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]

encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol())) 
for indexer in indexers]

assemblerCat = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "cat")

pipelineCat = Pipeline(stages = indexers + encoders + [assemblerCat])
df = pipelineCat.fit(df).transform(df)

assemblerNum = VectorAssembler(inputCols = num_cols, outputCol = "num")

pipelineNum = Pipeline(stages = [assemblerNum])
df = pipelineNum.fit(df).transform(df)

assembler = VectorAssembler(inputCols = ["cat", "num"], outputCol = "features")

pipeline = Pipeline(stages = [assembler])
df = pipeline.fit(df).transform(df)

Essentially I am creating one pipeline for categorical and one for numerical variables and then I am merging them to create a single "features" column which contains both.

Create "features" column in PySpark with both numerical and categorical variables

1 Answers1