How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

Question

I am using Spark 1.5.1 and, In pyspark, after I fit the model using:

model = LogisticRegressionWithLBFGS.train(parsedData)

I can print the prediction using:

model.predict(p.features)

Is there a function to print the probability score also along with the prediction?

desertnaut · Accepted Answer · 2017-07-13T09:09:29.760

8

You have to clear the threshold first, and this works only for binary classification:

 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint

 parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
                LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
                LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
                LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
                LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]   

 model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data))
 model.threshold
 # 0.5
 model.predict(parsed_data[2].features)
 # 1

 model.clearThreshold()
 model.predict(parsed_data[2].features)
 # 0.9873840020002339

edited Jul 13 '17 at 09:09

answered Nov 06 '15 at 08:41

desertnaut

57,590
26
140
166

From the documentation I couldnt find a way to do the same for multiclass classification. Are you aware if it is possible? The only way i thought is to make a manual 1-vs-all – Mpizos Dimitris Mar 22 '16 at 09:59
@MpizosDimitris, This requires changing the actual function. I have just implemented this in Scala and can provide an answer for a new question – Brian Mar 24 '16 at 21:10
@BrianVanover http://stackoverflow.com/questions/36151568/probability-of-predictions-using-logisticregressionwithlbfgs-for-multiclass-clas – Mpizos Dimitris Mar 29 '16 at 06:57
@desertnaut, looks like there is no change regarding support for multiclass classification with spark 2.2.0 with MLLib. Is the spark community recommending to use ML package instead. wondering why spark is lacking on this classifiers when the support is available in scipy and even with octave. – sunny Jul 16 '17 at 20:52
@sunny Indeed MLlib is headed for deprecation - ML is the recommended package now – desertnaut Jul 16 '17 at 20:58
@desertnaut, i was getting little fed up by moving btw ML and mllib , so instead of trying with LBFGS classifier , I tried with RandForest classifier and it worked. The only challenge is that it is computationally intense as the error goes down with increase in number of trees and depth. Since it had no accuracy method like above, I had make one which I had posted in https://stackoverflow.com/questions/28818692/pyspark-mllib-class-probabilities-of-random-forest-predictions/45135869#45135869 – sunny Jul 17 '17 at 04:17

score 0 · Answer 2 · edited Jul 17 '17 at 09:13

0

I presume the question is on computing probability score for the predicting the entire training set. if so , I did the following to compute it. Not sure if the post is still active, but this is howI did this:

#get the original training data before it was converted to rows of LabelPoint.
#let us assume it is otd  ( of type spark DataFrame)
#let us extract the featureset as rdd by:
fs=otd.rdd.map(lambda x:x[1:]) # assuming label is col 0.

#the below is just a sample way of creating a Labelpoint rows..
parsedData= otd.rdd.map(lambda x: reg.LabeledPoint(int(x[0]-1),x[1:]))

# now convert otd to a panda DataFrame as:
ptd= otd.toPandas()
m= ptd.shape[0]
# train and get the model
model=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)


#Now store the model.predict rdd structures 
predict=model.predict(fs)
pr=predict.collect()

correct=0
correct = ((ptd.label-1) == (pr)).sum()
print((correct/m) *100)

Note the above is for multi-class classification.

edited Jul 17 '17 at 09:13

desertnaut

57,590
26
140
166

answered Jul 17 '17 at 07:12

sunny

643
2
11
29

@desertnaut, please take a look if this makes sense. – sunny Jul 17 '17 at 07:13
1) `trainingdata` is nowhere defined 2) `fs` is nowhere used 3) it is not clear what the outcome of your code is, and if it indeed provides probabilities; that's why it is a good practice to provide dummy data and demonstrate the results, as I have done 4) `toPandas` is not a good idea, since it will only work for 'small' datasets (where you don't even need Spark) 5) the issue has been mostly resolved in ML: https://stackoverflow.com/questions/43631031/pyspark-how-to-get-classification-probabilities-from-multilayerperceptronclassi/43643426#43643426 – desertnaut Jul 17 '17 at 09:20
@desertnaut, I ran this code against my data-set which we were discussing on a different post. fs is passed as a argument to the predict. My training data is a 5000x400 matrix containing labels 1-10 for multi-class classifier. It is a hand written digits containing numbers from 1-10. I understand toPandas() is not efficient , but the goal was to compute probability. – sunny Jul 17 '17 at 15:53
1) probably, by `trainingData` you meant `parsedData` 2) if you can use `toPandas()` the way you use it, there is **absolutely no reason** to use Spark at all - you would do your job better with `pandas` & `scikit-learn` – desertnaut Jul 17 '17 at 16:06
Compared with `scikit-learn` and similar packages, the functionality in Spark ML/MLlib is really *primitive*; the only reason to use it is if your data do not fit into a single machine's main memory ('big data'), and hence you need to work on a distributed computing environment (cluster). – desertnaut Jul 17 '17 at 16:57
@desertnaut, i am evaluating which one to go for - should I do edge computing using scikit and then aggregated data to cloud or do I collect the data and do it centrally. Moreover two parallel track of ML and MLlib also adds to the confusion and clarity. To your point on toPandas(), note that I don't have to use toPandas and convert it to DataFrame if I had stored the data orginally in a format that allows we to compute probability as above. It is a convenience and probably there was a reason why Spark folks allows converting to Pandas DataFrame just like collect() converts RDD to list. – sunny Jul 18 '17 at 17:03
`toPandas` &`collect` exist solely as to allow for local processing of the **results** of (possibly successive) aggregations of large data that don't fit in memory. Let me repeat - if you do your processing in a laptop with Spark, you are simply imposing unnecessary pain to yourself without any benefit at all... – desertnaut Jul 18 '17 at 19:54

How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

2 Answers2

Linked