1

Spark MLlib provides several algorithm for classification, such as Random Forests and Logistic Regression. Examples of classifier training and class prediction are straightforward. Yet it is not clear what classifier API to use to get probability that given instance belongs to the predicted class. For example for Random Forests classifier:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

object RFClassifier {

  def main(args: Array[String]) {

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
    // set up environment
    val conf = new SparkConf()
      .setMaster("local[5]")
      .setAppName("RFClassifier")
      .set("spark.executor.memory", "2g")
    val sc = new SparkContext(conf)

    // Load and parse the data file.
    val data = MLUtils.loadLibSVMFile(sc, "in/sample_libsvm_data.txt")
    // Split the data into training and test sets (30% held out for testing)
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))

    // Train a RandomForest model.
    //  Empty categoricalFeaturesInfo indicates all features are continuous.
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val numTrees = 3 // Use more in practice.
    val featureSubsetStrategy = "auto" // Let the algorithm choose.
    val impurity = "gini"
    val maxDepth = 4
    val maxBins = 32

    val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

    // Evaluate model on test instances and compute test error
    val labelAndPreds = testData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
    println("Test Error = " + testErr)
    println("Learned classification forest model:\n" + model.toDebugString)

    // Save and load model
    model.save(sc, "RFClassifierModel")
    val sameModel = RandomForestModel.load(sc, "RFClassifierModel")
  }
}

How one can find out probabilities of the predicted classes? The same question remains for other classifiers as well. Any ideas? Thanks!

Update:

As a rough workaround: Every possible type of classifier to be used needs first to get trained with a training set. After training is done one can always find percent of correct predictions in this training set. Can this percent be used as a raw estimation of probability that any instance belongs to a predicted class? For example, if for a given classifier we get 80% of correct predictions in the training set, can we assume that average probability of an instance having a given class is 0.8 for this classifier?

zero323
  • 322,348
  • 103
  • 959
  • 935
zork
  • 2,085
  • 6
  • 32
  • 48
  • 1
    Dunno specifically about MLLib, but in general not all classifiers produce class probabilities. E.g. for logistic regression and neural networks with sigmoid output, the output is the class probability, but for SVM it is not (although there is a hackish way to compute a probability). For classifiers based on trees, the proportion of positive examples in a leaf on the tree can be taken as the class probability, and for a random forest it would then be the average of the outputs from all the trees. I would expect that's been programmed already but if not it wouldn't be too hard to write it. – Robert Dodier Jul 06 '15 at 23:43
  • Please, see my question update – zork Jul 07 '15 at 12:10

0 Answers0