Spark MLlib provides several algorithm for classification, such as Random Forests and Logistic Regression. Examples of classifier training and class prediction are straightforward. Yet it is not clear what classifier API to use to get probability that given instance belongs to the predicted class. For example for Random Forests classifier:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
object RFClassifier {
def main(args: Array[String]) {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("RFClassifier")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "in/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32
val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification forest model:\n" + model.toDebugString)
// Save and load model
model.save(sc, "RFClassifierModel")
val sameModel = RandomForestModel.load(sc, "RFClassifierModel")
}
}
How one can find out probabilities of the predicted classes? The same question remains for other classifiers as well. Any ideas? Thanks!
Update:
As a rough workaround: Every possible type of classifier to be used needs first to get trained with a training set. After training is done one can always find percent of correct predictions in this training set. Can this percent be used as a raw estimation of probability that any instance belongs to a predicted class? For example, if for a given classifier we get 80% of correct predictions in the training set, can we assume that average probability of an instance having a given class is 0.8 for this classifier?