0

I am using spark ml and would like to get the probability value calculated from a model in the probability column of a DataFrame.

This works :

proba_classe_0 = df.rdd.map(lambda row: row.probability[0])

But it gives me back an rdd. I know ml was rdd based, and mllib is now df-based. I tried to use foreach :

df2.foreach(lambda r: r.probability[0])

But It gives back nothing.

I have Two questions :

  1. How to make the foreach work ? I can't figure out
  2. Ultimately I am trying to use BinaryClassificationMetrics as we can see here : https://spark.apache.org/docs/2.3.0/api/python/pyspark.mllib.html?highlight=binaryclassificationmetrics#pyspark.mllib.evaluation.BinaryClassificationMetrics . As I can do the two needed values, how can I zip them to get them back together as a DF ?

Code:

proba_classe_1 = df.rdd.map(lambda row: row.probability[1])
truth          = df.rdd.map(lambda row: row.label         )
spark.createDataFrame(proba_classe_0)
TypeError: Can not infer schema for type: <class 'numpy.float64'>
pault
  • 41,343
  • 15
  • 107
  • 149
Romain Jouin
  • 4,448
  • 3
  • 49
  • 79
  • Is this what you're looking for: [How to access element of a VectorUDT column in a Spark DataFrame?](https://stackoverflow.com/questions/39555864/how-to-access-element-of-a-vectorudt-column-in-a-spark-dataframe) – pault Apr 18 '19 at 20:50
  • @pault : nice catch ! could solve my problem. but still doesn't let me know what about the foreach :-/ – Romain Jouin Apr 18 '19 at 20:52
  • *[`foreach`](https://spark.apache.org/docs/latest/api/python/pyspark.sql#pyspark.sql.DataFrame.foreach) is a shorthand for `df.rdd.foreach()`* – pault Apr 18 '19 at 20:55
  • Now I have a strange : StructType can not accept object 0.6196931452705625 in type : https://stackoverflow.com/questions/55753710/pyspark-cant-convert-float-to-float – Romain Jouin Apr 18 '19 at 21:02

0 Answers0