PySpark Foreach

Question

I am using spark ml and would like to get the probability value calculated from a model in the probability column of a DataFrame.

This works :

proba_classe_0 = df.rdd.map(lambda row: row.probability[0])

But it gives me back an rdd. I know ml was rdd based, and mllib is now df-based. I tried to use foreach :

df2.foreach(lambda r: r.probability[0])

But It gives back nothing.

I have Two questions :

How to make the foreach work ? I can't figure out
Ultimately I am trying to use BinaryClassificationMetrics as we can see here : https://spark.apache.org/docs/2.3.0/api/python/pyspark.mllib.html?highlight=binaryclassificationmetrics#pyspark.mllib.evaluation.BinaryClassificationMetrics . As I can do the two needed values, how can I zip them to get them back together as a DF ?

Code:

proba_classe_1 = df.rdd.map(lambda row: row.probability[1])
truth          = df.rdd.map(lambda row: row.label         )
spark.createDataFrame(proba_classe_0)

TypeError: Can not infer schema for type: <class 'numpy.float64'>

Is this what you're looking for: [How to access element of a VectorUDT column in a Spark DataFrame?](https://stackoverflow.com/questions/39555864/how-to-access-element-of-a-vectorudt-column-in-a-spark-dataframe) — pault, Apr 18 '19 at 20:50
@pault : nice catch ! could solve my problem. but still doesn't let me know what about the foreach :-/ — Romain Jouin, Apr 18 '19 at 20:52
*[`foreach`](https://spark.apache.org/docs/latest/api/python/pyspark.sql#pyspark.sql.DataFrame.foreach) is a shorthand for `df.rdd.foreach()`* — pault, Apr 18 '19 at 20:55
Now I have a strange : StructType can not accept object 0.6196931452705625 in type : https://stackoverflow.com/questions/55753710/pyspark-cant-convert-float-to-float — Romain Jouin, Apr 18 '19 at 21:02

PySpark Foreach

0 Answers0