I am trying to run a function containing a for loop in parallel using spark_apply in Databricks on Azure.
My function is:
distribution <- function(sims){
for (p in 1:100){
increment_value <- list()
profiles <- list()
samples <- list()
sample_num <- list()
for (i in 1:length(samp_seq)){
w <- sample(sims, size=batch)
z <- sum(w)
name3 <- as.character(z)
samples[[name3]] <- data.frame(value = z)
}
}
}
when I put the function in spark_apply like this:
sdf_len(sc,1) %>%
spark_apply(distribution)
I get the following error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.