2

I am trying to create a "features" column in Spark using Python in order to be used by the Machine Learning libraries. However, I am having issues including both numerical and categorical variables in the VectorAssembler which generates the "features" column.

cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]

indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]

encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol())) 
for indexer in indexers]

assembler = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "features")

pipeline = Pipeline(stages = indexers + encoders + [assembler])
df = pipeline.fit(df).transform(df)

The pipeline constructed up to now can create a "features" column containing only the categorical variables but I have no idea how to extend it such that the "features" column contains both the categorical and the numerical variables.

Please note that I am using Spark 2.3 along with Python 3.

agrajag_42
  • 155
  • 1
  • 12

1 Answers1

7

I have found a way to do it but I not sure if this is the most efficient way of achieving what I want.

cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]

indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]

encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol())) 
for indexer in indexers]

assemblerCat = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "cat")

pipelineCat = Pipeline(stages = indexers + encoders + [assemblerCat])
df = pipelineCat.fit(df).transform(df)

assemblerNum = VectorAssembler(inputCols = num_cols, outputCol = "num")

pipelineNum = Pipeline(stages = [assemblerNum])
df = pipelineNum.fit(df).transform(df)

assembler = VectorAssembler(inputCols = ["cat", "num"], outputCol = "features")

pipeline = Pipeline(stages = [assembler])
df = pipeline.fit(df).transform(df)

Essentially I am creating one pipeline for categorical and one for numerical variables and then I am merging them to create a single "features" column which contains both.

agrajag_42
  • 155
  • 1
  • 12