1

I am working with the sparklyr package with data in Spark. I'm building a logistic regression model and having some trouble figuring out how to get the probabilities for each class out of the probability column of the predictions data frame produced by ml_predict() on my model.

Here is some short sample code that demonstrates what I'm doing:

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local[1]", version = "2.3.2")

iris_sc <- copy_to(sc, iris)
modelPipeline <- ml_pipeline(sc) %>%
   ft_r_formula(Species ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width) %>%
   ml_logistic_regression()

modelFit <- ml_fit(modelPipeline, iris_sc)

predictions <- ml_predict(modelFit, iris_sc)

Created on 2018-10-30 by the reprex package (v0.2.1)

This produces a Spark dataframe with a column named probability which is a org.apache.spark.ml.linalg.VectorUDT datatype in Spark with three elements per row representing the model's probability predictions for each of the three possible classes.

How can I get one of these values out of this object using sparklyr? Something like probability[1] of course doesn't work in sparklyr and I'm not finding a function in dplyr or sparklyr which might be useful.

Dave Kincaid
  • 3,970
  • 3
  • 24
  • 32
  • This link might have some useful comments. https://stackoverflow.com/questions/43589762/sparklyr-how-to-explode-a-list-column-into-their-own-columns-in-spark-table – strawberryBeef Oct 31 '18 at 14:33

0 Answers0