How to combine several columns into one vector-valued column in Spark?

Question

I have just started to use MLlib from Spark. I want to train a simple model (for example logistic regression). My expectation was that I need to "tell" to the model what column to use as target and what columns to treat as features.

However, it looks like there should be just one column with the features (a column containing vectors as values).

So, my question is: How to construct such a vector valued column? I have tried the following (but it does not work):

df = df.withColumn('feat_vec', [df['_c0'], df['_c1'], df['_c1'], df['_c3'], df['_c4']])

ADDED

I have also tried this:

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['_c0', '_c1', '_c2', '_c3', '_c4'], outputCol='feat_vec')
df = assembler.transform(df)

As the result I get the following error message:

pyspark.sql.utils.IllegalArgumentException: u'Data type StringType is not supported.'

I think you've got it all wrong. Take a look [here](https://stackoverflow.com/questions/32982425/encode-and-assemble-multiple-features-in-pyspark). — David Arenburg, Jun 14 '17 at 15:13
Check my answer here re VectorAssembler: https://stackoverflow.com/questions/43355341/spark-pipeline-error/43378263#43378263 — TDrabas, Jun 14 '17 at 16:29
The problem with VectorAssembler is exactly what I pointed to: one (or more) elements or the RDD line are string. You can either precede this with OneHotEncoder or somehow encode the string to numeric. Then I'd suggest putting that into a LabeledPoint if you want to build a supervised model like logistic regression. — TDrabas, Jun 14 '17 at 16:40

Marsellus Wallace · Answer 1 · 2017-06-14T17:13:46.713

3

Using VectorAssembler is the way to go. In a linalg.Vector you can only have Double values. You need to add a StringIndexer + OneHotEncoder in your Pipeline. Then you can use the assembler over the new generated column

E.G. (from link)

from pyspark.ml.feature import OneHotEncoder, StringIndexer

df = spark.createDataFrame([
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c")
], ["id", "category"])

stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
model = stringIndexer.fit(df)
indexed = model.transform(df)

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
encoded.show()

P.S. Please take a look at Pipelines

edited Jun 14 '17 at 17:13

answered Jun 14 '17 at 16:09

Marsellus Wallace

17,991
25
90
154

1

from your answer I have learned something useful (basically how to do one hot encoding in Spark) but it does not provide an answer to my question. I do not have categorical features. The features that I have are numeric (although they have be represented as strings). – Roman Jun 14 '17 at 19:59
Maybe I'm misunderstanding the issue. However, if your features are numeric and only have a String type, can't you just cast them to Double before passing them to VectorAssembler? Could you add some sample data to the question? – Marsellus Wallace Jun 14 '17 at 21:11
you are right. This was the reason why the VectorAssembler did not work. First, I did not know that the values are strings. Second, I did not know that they have to be double or float. – Roman Jun 15 '17 at 08:49

How to combine several columns into one vector-valued column in Spark?

1 Answers1