2

I'm using Spark 2.2 ML RandomForestClassifier to do some predictions.

I have a result like this:

+-----+----------------------------------------+----+----------+
|label|features                                |prob|prediction|
+-----+----------------------------------------+----+----------+
|0.0  |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,5,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,6,7,12,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
+-----+----------------------------------------+----+----------+

Now I'd like to decode features back to human-readable representations, e.g. I want to know what exactly feature with index 4 represents.

I assume I could get this information from indexer's labels, but I have a code like this:

private def transform() {   
  val aIndexer = indexer("a")
  val bIndexer = indexer("b")
  val cIndexer = indexer("c")

  val aEncoder = encoder("a")
  val bEncoder = encoder("b")
  val cEncoder = encoder("c")

  val vectorAssembler = new VectorAssembler()
    .setInputCols(Array("aVec", "bVec", "cVec"))
    .setOutputCol("features")

  val indexers = Array[PipelineStage](aIndexer, bIndexer, cIndexer)
  val encoders = Array[PipelineStage](aEncoder, bEncoder, cEncoder)

  val pipeline = new Pipeline().setStages(indexers ++ encoders :+ vectorAssembler)
  val model = pipeline.fit(in)
  model.write.overwrite().save(opts.pipelineFileName)

  model.transform(in).show(false)
}

private def indexer(name: String): StringIndexer = {
  new StringIndexer().setInputCol(name).setOutputCol(s"${name}Idx").setHandleInvalid("keep")
}

private def encoder(name: String): OneHotEncoder = {
  new OneHotEncoder().setInputCol(s"${name}Idx").setOutputCol(s"${name}Vec").setDropLast(false)
}

And it seems it's not possible to access indexer's labels to do any kind of matching.

For simplicity let's assume the following: I have 3 categorical features - A, B and C. They have values A1, A2, B1, B2, C1, C2.

What I want to do is to match that the feature at index 4 in the resulting vector means B2.

Is there any way to do that?

serejja
  • 22,901
  • 6
  • 64
  • 72

0 Answers0