Spark ML StringIndexer and OneHotEncoder - empty strings error

Question

I've been trying to understand why the Spark ML OneHotEncoder transformation is throwing empty string errors when there are no empty strings being passed to it that I can see.

For replicability, I have the following sample df:

val df = sparkSession.createDataFrame(Seq(
  (0, "apple"),
  (1, "banana"),
  (2, ""),
  (1, "banana"),
  (2, null)
)).toDF("id", "fruit")

Now, I wish to use the fruit column in some ML algorithms, so want to encode it as a vector. To do so, I first run a StringIndexer transformation, and then run OneHotEncoder on the output of that indexing, running the whole thing through a pipeline:

val indexer = new StringIndexer()
  .setInputCol("fruit")
  .setOutputCol("fruit_category")
  .setHandleInvalid("keep")

val encoder = new OneHotEncoder()
  .setInputCol("fruit_category")
  .setOutputCol("fruit_vec")

// create pipleline
val transform_pipleline = new Pipeline().setStages(Array(indexer, encoder))

// run pipleline on DF to create model
val index_model = transform_pipleline.fit(df)

// use the model to actually transform a DF
val df2 = index_model.transform(df)

However, when do this, I get

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.

However, the column I am passing to OneHotEncoder is the output of the StringIndexer ("fruit_category"), which has NO BLANK STRINGS, i.e.:

+---+------+--------------+
| id| fruit|fruit_category|
+---+------+--------------+
|  0| apple|           2.0|
|  1|banana|           0.0|
|  2|      |           1.0|
|  1|banana|           0.0|
|  2|  null|           3.0|
+---+------+--------------+

What is going on here? Is OneHotEncoder somehow using the original labels retained by the StringIndexer? I though I didn't need to remove any blank strings since the indexer indexed those as doubles?

Spark ML StringIndexer and OneHotEncoder - empty strings error

0 Answers0