2

First, let's define a couple of Spark dataframes dfString and dfDouble respectively made up of strings and doubles as such:

val dfString = sqlContext.createDataFrame(Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"))).toDF("colx", "coly", "colz")
val dfDouble = sqlContext.createDataFrame(Seq((0.0, 1.0, 2.0), (3.0, 4.0, 5.0))).toDF("colx", "coly", "colz")

Second, we prepare a pipeline made up of a single transformer:

val va = new VectorAssembler().setInputCols(Array("colx", "coly", "colz")).setOutputCol("ft")
val pipeline = new Pipeline().setStages(Array(va))

Fitting this pipeline to dfDouble will return the expected result where all columns are concatenated into a single column called ft. But pipeline.fit(dfString) throws

java.lang.IllegalArgumentException: Data type StringType is not supported.

The question is: How to obtain the same result with the strings as we get from the doubles while keeping within the pipeline framework?

Note that this is not a duplicate of Concatenate columns in apache spark dataframe since

  • I want to use only transformers to go in a pipeline framework

  • and that I do not want to use a StringIndexer transformer.

I am using Spark 1.6.

Community
  • 1
  • 1
ranlot
  • 636
  • 1
  • 6
  • 14
  • 1
    _How to obtain the same result_ - it is not possible since `Vector` type cannot store strings. If you want to build an array you'll have to write your own transformer. – zero323 Mar 01 '16 at 13:22
  • I'm surprised since it feels like something quite natural to want to do. Could you please give a hint about how to get started? – ranlot Mar 01 '16 at 13:28
  • 1
    Not so much. There is not easy to find a scenario in ML where it can be useful. https://stackoverflow.com/questions/35180527/how-to-create-a-custom-transformer-from-a-udf – zero323 Mar 01 '16 at 13:33

0 Answers0