0

In the following code, I want to replace map_dfr from purrr with one of the SparkR apply functions to parallelize the Shapley calculations on the azure databricks:

#install.packages("randomForest"); install.packages("tidyverse"); install.packages("iml"); install.packages(SparkR)
library(tidyverse); library(iml); library(randomForest); library(SparkR) 

mtcars1 <- mtcars %>%  mutate(vs = as.factor(vs), id = row_number())

x <- "vs"
y <- paste0(setdiff(setdiff(names(mtcars1), "vs"), "id"), collapse = "+")

rf = randomForest(as.formula(paste0(x, "~ ", y)), data = mtcars1, ntree = 50)

predictor <- Predictor$new(rf, data = mtcars1, y = mtcars1$vs)

shapelyresults <- map_dfr(1:nrow(mtcars), ~(Shapley$new(predictor, x.interest = mtcars1[.x,]) %>% 
                                              .$results %>% 
                                              as_tibble() %>% 
                                              arrange(desc(phi)) %>% 
                                              slice(1:5) %>% 
                                              select(feature.value, phi) %>%
                                              mutate(id = .x)))

I could not leverage the answer on the following link: How to apply a function to each row in SparkR?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Geet
  • 2,515
  • 2
  • 19
  • 42
  • That would be one of the apply function in sparkR. See the docs here https://spark.apache.org/docs/latest/sparkr.html#applying-user-defined-function – ookboy24 Dec 05 '18 at 07:25
  • Can you please help me convert the above map_dfr into that dapply syntax? – Geet Dec 05 '18 at 08:12

0 Answers0