3

i have a problem with ml.crossvalidator in scala spark while using one hot encoder.

this is my code

val tokenizer = new Tokenizer().
                    setInputCol("subjects").
                    setOutputCol("subject")

//CountVectorizer / TF
val countVectorizer = new CountVectorizer().
                        setInputCol("subject").
                        setOutputCol("features")

// convert string into numerical values
val labelIndexer = new StringIndexer().
                        setInputCol("labelss").
                        setOutputCol("labelsss")

// convert numerical to one hot encoder
val labelEncoder = new OneHotEncoder().
                   setInputCol("labelsss").
                   setOutputCol("label")

val logisticRegression = new LogisticRegression()

val pipeline = new Pipeline().setStages(Array(tokenizer,countVectorizer,labelIndexer,labelEncoder,logisticRegression))

and give me an error like this

cv: org.apache.spark.ml.tuning.CrossValidator = cv_8cc1ae985e39
java.lang.IllegalArgumentException: requirement failed: Column label must be of type NumericType but was actually of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.

i have no idea, how to fix it.

i need one hot encoder coz my label is categorical.

thanks for helping me :)

eliasah
  • 39,588
  • 11
  • 124
  • 154

1 Answers1

3

There is actually no need to use OneHotEncoder/OneHotEncoderEstimator for labels (target variables) and you actually shouldn't. This will create a vector (type org.apache.spark.ml.linalg.VectorUDT).

StringIndexer is enough to define that your labels are categorical.

Let's check that in a small example :

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")).toDF("category", "text")
// df: org.apache.spark.sql.DataFrame = [category: int, text: string]

val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)
// indexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_cf691c087e1d

val indexed = indexer.transform(df)
// indexed: org.apache.spark.sql.DataFrame = [category: int, text: string ... 1 more field]

indexed.schema.map(_.metadata).foreach(println)
// {}
// {}
// {"ml_attr":{"vals":["4","5","1","0","2","3"],"type":"nominal","name":"categoryIndex"}}

As you have noticed, StringIndexer actually attach metadata to that column (categoryIndex) and marks it as nominal a.k.a categorical.

You can also notice that in the attribute of the column, you have the list of categories.

More on this in my other answer about How to handle categorical features with spark-ml?

Concerning data preparation and metadata with spark-ml, I strongly advice you to read the following entry :

https://github.com/awesome-spark/spark-gotchas/blob/5ad4c399ffd2821875f608be8aff9f1338478444/06_data_preparation.md

Disclaimer: I'm the co-author of the entry in the link.

Note: (excerpt from the doc)

Because this existing OneHotEncoder is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new OneHotEncoderEstimator was created that produces an OneHotEncoderModel when fitting. For more detail, please see SPARK-13030.

OneHotEncoder has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use OneHotEncoderEstimator instead.

Community
  • 1
  • 1
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • thanks for help. ya stringIndexer is enough. but my prof say if we used numerical for categorical still have disadvantages. a , b, c = 0,1,2 . but thats mean c>b . so we need one hot encoder. but thay say while teaching MLP. is LogReg need oneHotEncoder too ? –  May 31 '18 at 05:28
  • I don’t believe he said that for labels. Logistic regression can benefit sometimes from OHE, and sometimes it’s not actually needed. Feature engineering depends on the learning task, nature of the data and most importantly model performance. OHE doesn’t behave the same way with RF as in LR. @AliHelmutBaltschun – eliasah May 31 '18 at 05:50