Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

Question

It is known that GBT s in Spark gives you predicted labels as of now.

I was thinking of trying to calculate predicted probabilities for a class (say all the instances falling under a certain leaf)

The codes to build GBT's

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)  

model.toDebugString

This gives me 2 trees of depth 2 as below for simplicity:

 Tree 0:
    If (feature 3 <= 2.0)
     If (feature 2 <= 1.25)
      Predict: -0.5752212389380531
     Else (feature 2 > 1.25)
      Predict: 0.07462686567164178
    Else (feature 3 > 2.0)
     If (feature 0 <= 30.17)
      Predict: 0.7272727272727273
     Else (feature 0 > 30.17)
      Predict: 1.0
  Tree 1:
    If (feature 5 <= 67.0)
     If (feature 4 <= 100.0)
      Predict: 0.5739387416147804
     Else (feature 4 > 100.0)
      Predict: -0.550117566730937
    Else (feature 5 > 67.0)
     If (feature 2 <= 0.0)
      Predict: 3.0383669122382835
     Else (feature 2 > 0.0)
      Predict: 0.4332824083446489

My question is: Can I use the above trees to calculate predicted probabilities like:

With respect to every instance in the feature set used for prediction

exp(leaf score from tree 0 + leaf score from tree 1)/(1+exp(leaf score from tree 0 + leaf score from tree 1))

This gives me a kind of probability. But not sure if it is the right way to do it. Also if there is any document explaining how leaf score (prediction) are calculated. I would be really grateful if anybody can share.

Any suggestion would be superb.

Boxuan · Answer 1 · 2016-08-01T17:24:09.370

Here is my approach using Spark internal dependencies. You will need to import the linear algebra library for the matrix operation later, i.e., multiplying the tree predictions with the learning rate.

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}

Say you build a model with GBT:

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

To calculate the probability using the model object:

// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }

// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)

// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect

// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

Here is a code snippet you can copy & paste directly into spark-shell:

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
  val parts = line.split(',').map(_.toDouble)
  LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect

// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

score 1 · Answer 2 · answered Jun 13 '16 at 14:22

def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
    val treePredictions = gbdt.trees.map(_.predict(features))
    blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
def sigmoid(v : Double) : Double = {
    1/(1+Math.exp(-v))
}
// model is output of GradientBoostedTrees.train(...,...)
// testData is libSVM format
val labelAndPreds = testData.map { point =>
        var prediction = score(point.features,model)
        prediction = sigmoid(prediction)
        (point.label, Vectors.dense(1.0-prediction, prediction))
}

score 0 · Answer 3 · answered May 30 '16 at 15:28

0

Actually I was able predict the probabilities using the tree and the formulation of the tree given in the question. I actually checked with the GBT predicted labels output. It matches exactly when I use threshold as 0.5.

So we do the same with a slight change.

With respect to every instance in the feature set used for prediction:

exp(leaf score from tree 0 + (learning_rate)* leaf score from tree 1)/(1+exp(leaf score from tree 0 + (learning_rate)* leaf score from tree 1))

This essentially gives me the predicted probabilities.

I tested the same on 3 trees with depth 3. It worked. And also with different data sets.

It would be great to know if anyone else have already tried this. If not, they can try this and comment.

answered May 30 '16 at 15:28

PARTHA TALUKDER

321
5
17

2

Why don't you paste the probability calculating code here. That will help the community – Run2 Jun 13 '16 at 10:26
This is the same as the other answers since exp(x)/(1+exp(x)) = 1/(1+exp(-x)), and the weight for tree 0 is 1 as opposed to learning rate – Brian Nov 10 '16 at 13:37
Can Someone help me out in understanding how to do this excercise using the GBTClassifier which uses dataframes? – Rajarshi Bhadra Nov 21 '16 at 12:47

score 0 · Answer 4 · answered Jul 03 '17 at 08:00

In fact, the above ans is wrong, sigmoid function is false in this situation for spark translate label into {-1,1}. You should use a code like this:

def score(features: Vector,gbdt: GradientBoostedTreesModel): Double = {
    val treePredictions = gbdt.trees.map(_.predict(features))
    blas.ddot(gbdt.numTrees, treePredictions, 1, gbdt.treeWeights, 1)
}
val labelAndPreds = testData.map { point =>
        var prediction = score(point.features,model)
        prediction = 1.0 / (1.0 + math.exp(-2.0 * prediction))
        (point.label, Vectors.dense(1.0-prediction, prediction))
}

The more detail can be seen in page 9 of "Greedy Function Approximation? A Gradient Boosting Machine". And a pull request in spark: https://github.com/apache/spark/pull/16441

score 0 · Answer 5 · answered Apr 12 '19 at 06:18

In fact ,@hbghhy saw is wrong ,@Run2 is right ,Spark use twice the binomial negative log likelihood as Loss ,but Friedman use binomial negative log likelihood as Loss in page 9 of "Greedy Function Approximation" .

/**
 * :: DeveloperApi ::
 * Class for log loss calculation (for classification).
 * This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999).
 *
 * The log loss is defined as:
 *   2 log(1 + exp(-2 y F(x)))
 * where y is a label in {-1, 1} and F(x) is the model prediction for features x.
 */
@Since("1.2.0")
@DeveloperApi
object LogLoss extends ClassificationLoss {

  /**
   * Method to calculate the loss gradients for the gradient boosting calculation for binary
   * classification
   * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
   * @param prediction Predicted label.
   * @param label True label.
   * @return Loss gradient
   */
  @Since("1.2.0")
  override def gradient(prediction: Double, label: Double): Double = {
    - 4.0 * label / (1.0 + math.exp(2.0 * label * prediction))
  }

  override private[spark] def computeError(prediction: Double, label: Double): Double = {
    val margin = 2.0 * label * prediction
    // The following is equivalent to 2.0 * log(1 + exp(-margin)) but more numerically stable.
    2.0 * MLUtils.log1pExp(-margin)
  }
}

Predicting probabilities of classes in case of Gradient Boosting Trees in Spark using the tree output

5 Answers5

Linked