I'm using Spark 2.2 ML RandomForestClassifier to do some predictions.
I have a result like this:
+-----+----------------------------------------+----+----------+
|label|features |prob|prediction|
+-----+----------------------------------------+----+----------+
|0.0 |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,5,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,6,7,12,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
+-----+----------------------------------------+----+----------+
Now I'd like to decode features back to human-readable representations, e.g. I want to know what exactly feature with index 4 represents.
I assume I could get this information from indexer's labels, but I have a code like this:
private def transform() {
val aIndexer = indexer("a")
val bIndexer = indexer("b")
val cIndexer = indexer("c")
val aEncoder = encoder("a")
val bEncoder = encoder("b")
val cEncoder = encoder("c")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("aVec", "bVec", "cVec"))
.setOutputCol("features")
val indexers = Array[PipelineStage](aIndexer, bIndexer, cIndexer)
val encoders = Array[PipelineStage](aEncoder, bEncoder, cEncoder)
val pipeline = new Pipeline().setStages(indexers ++ encoders :+ vectorAssembler)
val model = pipeline.fit(in)
model.write.overwrite().save(opts.pipelineFileName)
model.transform(in).show(false)
}
private def indexer(name: String): StringIndexer = {
new StringIndexer().setInputCol(name).setOutputCol(s"${name}Idx").setHandleInvalid("keep")
}
private def encoder(name: String): OneHotEncoder = {
new OneHotEncoder().setInputCol(s"${name}Idx").setOutputCol(s"${name}Vec").setDropLast(false)
}
And it seems it's not possible to access indexer's labels to do any kind of matching.
For simplicity let's assume the following: I have 3 categorical features - A, B and C. They have values A1, A2, B1, B2, C1, C2.
What I want to do is to match that the feature at index 4 in the resulting vector means B2.
Is there any way to do that?