I am trying to create a "features" column in Spark using Python in order to be used by the Machine Learning libraries. However, I am having issues including both numerical and categorical variables in the VectorAssembler which generates the "features" column.
cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]
indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]
encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers]
assembler = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "features")
pipeline = Pipeline(stages = indexers + encoders + [assembler])
df = pipeline.fit(df).transform(df)
The pipeline constructed up to now can create a "features" column containing only the categorical variables but I have no idea how to extend it such that the "features" column contains both the categorical and the numerical variables.
Please note that I am using Spark 2.3 along with Python 3.