1

I've been trying to understand why the Spark ML OneHotEncoder transformation is throwing empty string errors when there are no empty strings being passed to it that I can see.

For replicability, I have the following sample df:

val df = sparkSession.createDataFrame(Seq(
  (0, "apple"),
  (1, "banana"),
  (2, ""),
  (1, "banana"),
  (2, null)
)).toDF("id", "fruit")

Now, I wish to use the fruit column in some ML algorithms, so want to encode it as a vector. To do so, I first run a StringIndexer transformation, and then run OneHotEncoder on the output of that indexing, running the whole thing through a pipeline:

val indexer = new StringIndexer()
  .setInputCol("fruit")
  .setOutputCol("fruit_category")
  .setHandleInvalid("keep")

val encoder = new OneHotEncoder()
  .setInputCol("fruit_category")
  .setOutputCol("fruit_vec")

// create pipleline
val transform_pipleline = new Pipeline().setStages(Array(indexer, encoder))

// run pipleline on DF to create model
val index_model = transform_pipleline.fit(df)

// use the model to actually transform a DF
val df2 = index_model.transform(df)

However, when do this, I get

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.

However, the column I am passing to OneHotEncoder is the output of the StringIndexer ("fruit_category"), which has NO BLANK STRINGS, i.e.:

+---+------+--------------+
| id| fruit|fruit_category|
+---+------+--------------+
|  0| apple|           2.0|
|  1|banana|           0.0|
|  2|      |           1.0|
|  1|banana|           0.0|
|  2|  null|           3.0|
+---+------+--------------+

What is going on here? Is OneHotEncoder somehow using the original labels retained by the StringIndexer? I though I didn't need to remove any blank strings since the indexer indexed those as doubles?

renegademonkey
  • 457
  • 1
  • 7
  • 18

0 Answers0