1

I am working with the R programming language.

Currently, I am trying to learn more about Parallel Computing and how to optimize running functions on large datasets, but I find I keep getting confused with the different options that are available. As an example, here are some of the options I have come across:

  • DoParallel
  • foreach
  • future
  • doSNOW
  • makePSOCKcluster
  • clusterEvalQ

To give some context to my problem - here is the dataset I am working with (the real dataset has over 10 million rows):

id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)


my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]

my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

      id results date_exam_taken exam_number
42599  1       1      2000-11-11           1
56091  1       1      2001-01-04           2
26039  1       0      2001-06-24           3
84767  1       1      2001-10-19           4
20920  1       1      2004-10-12           5
20653  1       1      2006-04-04           6

And here is the function I am currently running on this data (calculating conditional probabilities):

my_list = list()

for (i in 1:length(unique(my_data$id)))
    
{ 
    {tryCatch({
        
        start_i = my_data[my_data$id == i,]
        
        pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
        frame_i =  as.data.frame(table(pairs_i))
        frame_i$id = i
        print(frame_i)
        my_list[[i]] = frame_i
    }, error = function(e){})
    }}


 final = do.call(rbind.data.frame, my_list)

So far, I have learned how to run this same code using two different Parallel Computing based options in R:

Option 1: DoParallel and foreach

# Load required libraries
library(foreach)
library(doParallel)

# Set up a parallel cluster with 4 cores
cl <- makeCluster(4)
registerDoParallel(cl)

# Define a function that takes a single value of "i" and computes the desired output
my_function <- function(i) {
    {tryCatch({
  start_i = my_data[my_data$id == i,]
  
  pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
  frame_i =  as.data.frame(table(pairs_i))
  frame_i$id = i
  
  return(frame_i)
}, error = function(e){})
    }
}

# Use the foreach function to apply "my_function" in parallel to each value of "i"
my_list <- foreach(i = 1:length(unique(my_data$id)), .combine = rbind) %dopar% my_function(i)

# Stop the parallel cluster
stopCluster(cl)

Option 2: clusterEvalQ

# Define the cluster of workers
cluster = makeCluster(4)

# Export the my_data object to the cluster
clusterExport(cluster, "my_data")

# Use clusterEvalQ to evaluate the code on the cluster
clusterEvalQ(cluster, {
  my_list = list()

  for (i in 1:length(unique(my_data$id))) {
    {tryCatch({
      start_i = my_data[my_data$id == i,]

      pairs_i =  data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
      frame_i =  as.data.frame(table(pairs_i))
      frame_i$id = i
      print(frame_i)
      my_list[[i]] = frame_i
    }, error = function(e){})
    }}

  final = do.call(rbind.data.frame, my_list)
})

Can someone please comment on the code I have written? Are these two methods essentially performing the same task at a similar level of efficiency? In general, are there any other methods I could employ to increase the efficiency of this code?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 2
    You are diving into the wrong rabbit hole. The speed-up possible with parallelization is smaller than the number of CPUs. Improving your function (using packages data.table or/and Rcpp) can speed this up by orders of magnitude. At least avoid `print` and data.frames if performance is important. – Roland Dec 13 '22 at 06:50
  • For a quick intro to the difference between sockets and forking see [here](https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html). – Rui Barradas Dec 13 '22 at 07:53
  • I agree with Roland, and in my opinion searching for performance gains; the absolute mandatory first step is to define a benchmark of comparison (so you can actually judge whether you are making faster code, you might be writing slower code). Become familiar with appropriate tools. packages for this, I used to like microbenchmark but now I prefer bench. also the profvis package to profile ones code. good luck and happy learning. – Nir Graham Dec 16 '22 at 11:39

0 Answers0