1

I am trying to run a function containing a for loop in parallel using spark_apply in Databricks on Azure.

My function is:

distribution <- function(sims){
 for (p in 1:100){
  increment_value <- list()
  profiles <- list()
  samples <- list()
  sample_num <- list()
  for (i in 1:length(samp_seq)){
    w <- sample(sims, size=batch)
    z <- sum(w)
    name3 <- as.character(z)
    samples[[name3]] <- data.frame(value = z)
  }
} 
}

when I put the function in spark_apply like this:

sdf_len(sc,1) %>%
  spark_apply(distribution)

I get the following error:

Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details. Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 305.0 failed 4 times, most recent failure: Lost task 0.3 in stage 305.0 (TID 297, 10.139.64.6, executor 0): java.lang.Exception: sparklyr worker rscript failure with status 255, check worker logs for details.
Emer
  • 3,734
  • 2
  • 33
  • 47
  • Hi @Mike Woods-DeWitt, welcome to the community. It is usually a good practice to share a reproducible example so that the people that want to help you can reproduce the issue. Could you please share the dataframe you are applying the function? Here are some hints on how to do so: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Emer Apr 16 '21 at 18:46

0 Answers0