I am working with the sparklyr
package with data in Spark. I'm building a logistic regression model and having some trouble figuring out how to get the probabilities for each class out of the probability column of the predictions data frame produced by ml_predict()
on my model.
Here is some short sample code that demonstrates what I'm doing:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local[1]", version = "2.3.2")
iris_sc <- copy_to(sc, iris)
modelPipeline <- ml_pipeline(sc) %>%
ft_r_formula(Species ~ Sepal_Length + Sepal_Width + Petal_Length + Petal_Width) %>%
ml_logistic_regression()
modelFit <- ml_fit(modelPipeline, iris_sc)
predictions <- ml_predict(modelFit, iris_sc)
Created on 2018-10-30 by the reprex package (v0.2.1)
This produces a Spark dataframe with a column named probability
which is
a org.apache.spark.ml.linalg.VectorUDT
datatype in Spark with three elements per row representing the model's probability predictions for each of the three possible classes.
How can I get one of these values out of this object using sparklyr
? Something like probability[1]
of course doesn't work in sparklyr
and I'm not finding a function in dplyr
or sparklyr
which might be useful.