10

I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels.

My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.

I have found a similar topic about this, How to get classification probabilities from MultilayerPerceptronClassifier? but they use Java and the solution they suggested doesn't work in python.

Thx

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Ondrej
  • 414
  • 1
  • 5
  • 13

1 Answers1

12

Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:

from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+ 
# |label| features| 
# +-----+---------+ 
# | 0.0 |[0.0,1.0]| 
# | 1.0 |[1.0,0.0]| 
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)

# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()

lr_result = lr_model.transform(test)
lr_result.show()
# +---------+--------------------+--------------------+----------+
# | features|       rawPrediction|         probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...|       0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...|       1.0|  
# +---------+--------------------+--------------------+----------+

rf_result = rf_model.transform(test)
rf_result.show()
# +---------+-------------+--------------------+----------+ 
# | features|rawPrediction|         probability|prediction| 
# +---------+-------------+--------------------+----------+ 
# |[0.2,0.5]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# |[0.5,0.2]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# +---------+-------------+--------------------+----------+

For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Can you explain what the probability column is actually showing? In the lr example in the first row, is 0.7289 the predicted probability that the outcome is equal to 1? If so, then why is the prediction 0 (and the row with predicted probability = 0.2710 has prediction = 1)? – gannawag Feb 14 '18 at 20:36
  • 3
    @gannawag notice the dots (`...`); only the first element of the `probabilities` 2D array is shown here, i.e. in the first row the `probability[0]` has the greatest value (hence the prediction of `0.0`), while in the second row the (not shown) `probability[1]` has the greatest value, hence the prediction of `1.0`. Similarly, in RF, in both rows the `probability[1]` (again, not shown above) has the greatest value. hence both predictions are for class 1. The example is easily reproducible, just try it with `lr_result.show(truncate=False)` to see the full array values. – desertnaut Feb 14 '18 at 23:57
  • @gannawag 0.7289 is the probability that the outcome is `0.0` (Python indexes are zero-based), and the same holds for 0.2710 – desertnaut Feb 15 '18 at 00:07