I have just started to use MLlib from Spark. I want to train a simple model (for example logistic regression). My expectation was that I need to "tell" to the model what column to use as target and what columns to treat as features.
However, it looks like there should be just one column with the features (a column containing vectors as values).
So, my question is: How to construct such a vector valued column? I have tried the following (but it does not work):
df = df.withColumn('feat_vec', [df['_c0'], df['_c1'], df['_c1'], df['_c3'], df['_c4']])
ADDED
I have also tried this:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['_c0', '_c1', '_c2', '_c3', '_c4'], outputCol='feat_vec')
df = assembler.transform(df)
As the result I get the following error message:
pyspark.sql.utils.IllegalArgumentException: u'Data type StringType is not supported.'