I've been trying to understand why the Spark ML OneHotEncoder transformation is throwing empty string
errors when there are no empty strings being passed to it that I can see.
For replicability, I have the following sample df:
val df = sparkSession.createDataFrame(Seq(
(0, "apple"),
(1, "banana"),
(2, ""),
(1, "banana"),
(2, null)
)).toDF("id", "fruit")
Now, I wish to use the fruit column in some ML algorithms, so want to encode it as a vector. To do so, I first run a StringIndexer
transformation, and then run OneHotEncoder
on the output of that indexing, running the whole thing through a pipeline:
val indexer = new StringIndexer()
.setInputCol("fruit")
.setOutputCol("fruit_category")
.setHandleInvalid("keep")
val encoder = new OneHotEncoder()
.setInputCol("fruit_category")
.setOutputCol("fruit_vec")
// create pipleline
val transform_pipleline = new Pipeline().setStages(Array(indexer, encoder))
// run pipleline on DF to create model
val index_model = transform_pipleline.fit(df)
// use the model to actually transform a DF
val df2 = index_model.transform(df)
However, when do this, I get
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
However, the column I am passing to OneHotEncoder
is the output of the StringIndexer
("fruit_category"), which has NO BLANK STRINGS, i.e.:
+---+------+--------------+
| id| fruit|fruit_category|
+---+------+--------------+
| 0| apple| 2.0|
| 1|banana| 0.0|
| 2| | 1.0|
| 1|banana| 0.0|
| 2| null| 3.0|
+---+------+--------------+
What is going on here? Is OneHotEncoder somehow using the original labels
retained by the StringIndexer
? I though I didn't need to remove any blank strings since the indexer indexed those as doubles?