Spark VectorAssembler transformer with Strings

Question

First, let's define a couple of Spark dataframes dfString and dfDouble respectively made up of strings and doubles as such:

val dfString = sqlContext.createDataFrame(Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"))).toDF("colx", "coly", "colz")
val dfDouble = sqlContext.createDataFrame(Seq((0.0, 1.0, 2.0), (3.0, 4.0, 5.0))).toDF("colx", "coly", "colz")

Second, we prepare a pipeline made up of a single transformer:

val va = new VectorAssembler().setInputCols(Array("colx", "coly", "colz")).setOutputCol("ft")
val pipeline = new Pipeline().setStages(Array(va))

Fitting this pipeline to dfDouble will return the expected result where all columns are concatenated into a single column called ft. But pipeline.fit(dfString) throws

java.lang.IllegalArgumentException: Data type StringType is not supported.

The question is: How to obtain the same result with the strings as we get from the doubles while keeping within the pipeline framework?

Note that this is not a duplicate of Concatenate columns in apache spark dataframe since

I want to use only transformers to go in a pipeline framework
and that I do not want to use a StringIndexer transformer.

I am using Spark 1.6.

_How to obtain the same result_ - it is not possible since `Vector` type cannot store strings. If you want to build an array you'll have to write your own transformer. — zero323, Mar 01 '16 at 13:22
I'm surprised since it feels like something quite natural to want to do. Could you please give a hint about how to get started? — ranlot, Mar 01 '16 at 13:28
Not so much. There is not easy to find a scenario in ML where it can be useful. https://stackoverflow.com/questions/35180527/how-to-create-a-custom-transformer-from-a-udf — zero323, Mar 01 '16 at 13:33

Spark VectorAssembler transformer with Strings

0 Answers0