First, let's define a couple of Spark dataframes dfString
and dfDouble
respectively made up of strings
and doubles
as such:
val dfString = sqlContext.createDataFrame(Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"))).toDF("colx", "coly", "colz")
val dfDouble = sqlContext.createDataFrame(Seq((0.0, 1.0, 2.0), (3.0, 4.0, 5.0))).toDF("colx", "coly", "colz")
Second, we prepare a pipeline made up of a single transformer:
val va = new VectorAssembler().setInputCols(Array("colx", "coly", "colz")).setOutputCol("ft")
val pipeline = new Pipeline().setStages(Array(va))
Fitting this pipeline to dfDouble
will return the expected result where all columns are concatenated into a single column called ft
. But pipeline.fit(dfString)
throws
java.lang.IllegalArgumentException: Data type StringType is not supported.
The question is: How to obtain the same result with the strings as we get from the doubles while keeping within the pipeline framework?
Note that this is not a duplicate of Concatenate columns in apache spark dataframe since
I want to use only transformers to go in a pipeline framework
and that I do not want to use a
StringIndexer
transformer.
I am using Spark 1.6.