I am using spark ml and would like to get the probability value calculated from a model in the probability column of a DataFrame.
This works :
proba_classe_0 = df.rdd.map(lambda row: row.probability[0])
But it gives me back an rdd. I know ml was rdd based, and mllib is now df-based. I tried to use foreach :
df2.foreach(lambda r: r.probability[0])
But It gives back nothing.
I have Two questions :
- How to make the foreach work ? I can't figure out
- Ultimately I am trying to use BinaryClassificationMetrics as we can see here : https://spark.apache.org/docs/2.3.0/api/python/pyspark.mllib.html?highlight=binaryclassificationmetrics#pyspark.mllib.evaluation.BinaryClassificationMetrics . As I can do the two needed values, how can I zip them to get them back together as a DF ?
Code:
proba_classe_1 = df.rdd.map(lambda row: row.probability[1])
truth = df.rdd.map(lambda row: row.label )
spark.createDataFrame(proba_classe_0)
TypeError: Can not infer schema for type: <class 'numpy.float64'>