0

I am having following code, and trying to set label using string indexer and features using vector assembler

StructType schema = createStructType(new StructField[]{
                  createStructField("id", IntegerType, false),
                  createStructField("country", StringType, false),
                  createStructField("hour", IntegerType, false),
                  createStructField("clicked", DoubleType, false)
                });

                List<Row> data = Arrays.asList(
                  RowFactory.create(7, "US", 18, 1.0),
                  RowFactory.create(8, "CA", 12, 0.0),
                  RowFactory.create(9, "NZ", 15, 0.0)
                );

                Dataset<Row> dataset = sparkSession.createDataFrame(data, schema);

                StringIndexer indexer = new StringIndexer()
                          .setInputCol("clicked")
                          .setOutputCol("label");
                Dataset<Row> ds = indexer.fit(dataset).transform(dataset);
                VectorAssembler assembler = new VectorAssembler()
                          .setInputCols(new String[]{"id", "country", "hour"})
                          .setOutputCol("features");
                Dataset<Row> finalDS = assembler.transform(ds);

                LogisticRegression lr = new LogisticRegression()
                          .setMaxIter(10)
                          .setRegParam(0.3)
                          .setElasticNetParam(0.8);

                        // Fit the model
                        LogisticRegressionModel lrModel = lr.fit(finalDS);
                        Dataset<Row> output = lrModel.transform(finalDS);
                        output.select("features", "label").show();

when i am submitting it on spark, i am getting following error message:

 7/04/27 22:34:24 INFO DAGScheduler: Job 0 finished: countByValue at StringIndexer.scala:92, took 1.003742 s
Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType is not supported.
    at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
    at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
    at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
Bhagwati Malav
  • 3,349
  • 2
  • 20
  • 33

1 Answers1

0

VectorAssembler accepts only three types of columns:

DoubleType - double scalar, optionally with column metadata.

NumericType - arbitrary numeric.

VectorUDT - vector column.

for more ->

  1. Formatting data for spark ML
  2. How to create correct data frame for classification in Spark ML
Yash P Shah
  • 779
  • 11
  • 15