4

I am trying to create a Spark ML Pipeline with the Random Forest Classifier to perform classification (not regression), but I am getting an error saying the predicted label in my training set should be double instead of an integer. I am following instructions from these pages:

I have a Spark dataframe with following columns:

scala> df.show(5)
+-------+----------+----------+---------+-----+
| userId|duration60|duration30|duration1|label|
+-------+----------+----------+---------+-----+
|user000|        11|        21|       35|    3|
|user001|        28|        41|       28|    4|
|user002|        17|         6|        8|    2|
|user003|        39|        29|        0|    1|
|user004|        26|        23|       25|    3|
+-------+----------+----------+---------+-----+


scala> df.printSchema()
root
 |-- userId: string (nullable = true)
 |-- duration60: integer (nullable = true)
 |-- duration30: integer (nullable = true)
 |-- duration1: integer (nullable = true)
 |-- label: integer (nullable = true)

I am using the feature columns duration60, duration30, and duration1 to predict the categorical column label.

I then set up my Spark script like so:

import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.{Pipeline, PipelineModel}


Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)

val sqlContext = new SQLContext(sc)
val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header", "true"). // Use first line of all files as header
    option("inferSchema", "true"). // Automatically infer data types
    load("/tmp/features.csv").
    withColumnRenamed("satisfaction", "label").
    select("userId", "duration60", "duration30", "duration1", "label")

val assembler = new VectorAssembler().
    setInputCols(Array("duration60", "duration30", "duration1")).
    setOutputCol("features")


val randomForest = new RandomForestClassifier().
    setLabelCol("label").
    setFeaturesCol("features").
    setNumTrees(10)

var pipeline = new Pipeline().setStages(Array(assembler, randomForest))

var model = pipeline.fit(df);

The transformed dataframe is the following:

scala> assembler.transform(df).show(5)
+-------+----------+----------+---------+-----+----------------+
| userId|duration60|duration30|duration1|label|        features|
+-------+----------+----------+---------+-----+----------------+
|user000|        11|        21|       35|    3|[11.0,21.0,35.0]|
|user001|        28|        41|       28|    4|[28.0,41.0,28.0]|
|user002|        17|         6|        8|    2|  [17.0,6.0,8.0]|
|user003|        39|        29|        0|    1| [39.0,29.0,0.0]|
|user004|        26|        23|       25|    3|[26.0,23.0,25.0]|
+-------+----------+----------+---------+-----+----------------+

However, the last line throws an exception:

java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually IntegerType.

What does this mean, and how do I fix it?

Why does the label column need to be a double? I am doing prediction, not regression, so I thought a string or an integer is proper. A double value for a predicted column usually implies regression.

Community
  • 1
  • 1
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

3 Answers3

6

Do the cast DoubleType since that is the type the algorithm expects.

import org.apache.spark.sql.types._
df.withColumn("label", 'label cast DoubleType)

So, just before you val df in your application, in the last line in the sequence do the casting:

import org.apache.spark.sql.types._
val df = sqlContext.read.
    format("com.databricks.spark.csv").
    option("header", "true"). // Use first line of all files as header
    option("inferSchema", "true"). // Automatically infer data types
    load("/tmp/features.csv").
    withColumnRenamed("satisfaction", "label").
    select("userId", "duration60", "duration30", "duration1", "label")
    .withColumn("label", 'label cast DoubleType) // <-- HERE

Note that I've used 'label symbol (a single quote ' followed by a name) to reference the column label (which I might have also done using $"label" or col("label") or df("label") or column("label")).

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Thanks. Now I'm getting an error: `RandomForestClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.` What does that mean? And why should the label be a double if I'm doing classification? Usually continuous dependent variables are used for regression, not classification. – stackoverflowuser2010 Apr 14 '16 at 04:23
  • 1
    @stackoverflowuser2010 Manually casted columns lack metadata required for ML estimators to work. You'll have to add this manually. See for example http://stackoverflow.com/q/36517302/1560062 – zero323 Apr 27 '16 at 19:31
4

In pyspark

from pyspark.sql.types import DoubleType
df = df.withColumn("label", df.label.cast(DoubleType()))
Andres Kull
  • 4,756
  • 2
  • 15
  • 13
0

In case you are using pyspark and facing the same issue

from pyspark.ml.feature import StringIndexer
   stringIndexer = StringIndexer(inputCol="label", outputCol="newlabel")
   model = stringIndexer.fit(df)
   df = model.transform(df)
   df.printSchema()

This is the one way to cast the label column in to 'double' type.

  • I have downvoted because your answer is misleading (`other methods don't seem to work.`). The other answers do work when you have numeric labels ! – eliasah Jan 19 '18 at 07:55
  • I'm sorry my bad, edited it accordingly. Thanks for pointing it out – Vamsi Krishna Jan 22 '18 at 05:13
  • This is still not accurate. What you might want to point out is that in case you have non-numeric labels, StringIndexer *converts* them into the desired format and that’s not a *cast* – eliasah Jan 22 '18 at 06:14