1

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.

I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?

Things I've found:

  • This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
  • I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
  • Here is a post describing how to do it with R... but, again, I need to do it with Pyspark

Other facts

  • I'm using Apache Spark 2.4.0.
  • I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
Barranka
  • 20,547
  • 13
  • 65
  • 83

2 Answers2

3

If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).

Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.

Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.

For example:

from pyspark.mllib.evaluation import BinaryClassificationMetrics

# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
    def __init__(self, *args):
        super(CurveMetrics, self).__init__(*args)

    def _to_list(self, rdd):
        points = []
        # Note this collect could be inefficient for large datasets 
        # considering there may be one probability per datapoint (at most)
        # The Scala version takes a numBins parameter, 
        # but it doesn't seem possible to pass this from Python to Java
        for row in rdd.collect():
            # Results are returned as type scala.Tuple2, 
            # which doesn't appear to have a py4j mapping
            points += [(float(row._1()), float(row._2()))]
        return points

    def get_curve(self, method):
        rdd = getattr(self._java_model, method)().toJavaRDD()
        return self._to_list(rdd)

Usage:

import matplotlib.pyplot as plt

preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))

# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')

plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)

Results in: ROC curve generated with BinaryClassificationMetrics

Here's an example of an F1 score curve by threshold value if you aren't married to ROC: F1 score by threshold curve using BinaryClassificationMetrics

Alex Ross
  • 3,729
  • 3
  • 26
  • 26
1

One way is to use sklearn.metrics.roc_curve.

First use your fitted model to make predictions:

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)

Then collect your scores and labels1:

preds = predictions.select('label','probability')\
    .rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
    .collect()

Now transform preds to work with roc_curve

from sklearn.metrics import roc_curve

y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)

Notes:

  1. I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).
pault
  • 41,343
  • 15
  • 107
  • 149