I am working with the R programming language.
Currently, I am trying to learn more about Parallel Computing and how to optimize running functions on large datasets, but I find I keep getting confused with the different options that are available. As an example, here are some of the options I have come across:
- DoParallel
- foreach
- future
- doSNOW
- makePSOCKcluster
- clusterEvalQ
To give some context to my problem - here is the dataset I am working with (the real dataset has over 10 million rows):
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)
my_data = data.frame(id, results, date_exam_taken)
my_data <- my_data[order(my_data$id, my_data$date_exam_taken),]
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL
id results date_exam_taken exam_number
42599 1 1 2000-11-11 1
56091 1 1 2001-01-04 2
26039 1 0 2001-06-24 3
84767 1 1 2001-10-19 4
20920 1 1 2004-10-12 5
20653 1 1 2006-04-04 6
And here is the function I am currently running on this data (calculating conditional probabilities):
my_list = list()
for (i in 1:length(unique(my_data$id)))
{
{tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$id = i
print(frame_i)
my_list[[i]] = frame_i
}, error = function(e){})
}}
final = do.call(rbind.data.frame, my_list)
So far, I have learned how to run this same code using two different Parallel Computing based options in R:
Option 1: DoParallel and foreach
# Load required libraries
library(foreach)
library(doParallel)
# Set up a parallel cluster with 4 cores
cl <- makeCluster(4)
registerDoParallel(cl)
# Define a function that takes a single value of "i" and computes the desired output
my_function <- function(i) {
{tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$id = i
return(frame_i)
}, error = function(e){})
}
}
# Use the foreach function to apply "my_function" in parallel to each value of "i"
my_list <- foreach(i = 1:length(unique(my_data$id)), .combine = rbind) %dopar% my_function(i)
# Stop the parallel cluster
stopCluster(cl)
Option 2: clusterEvalQ
# Define the cluster of workers
cluster = makeCluster(4)
# Export the my_data object to the cluster
clusterExport(cluster, "my_data")
# Use clusterEvalQ to evaluate the code on the cluster
clusterEvalQ(cluster, {
my_list = list()
for (i in 1:length(unique(my_data$id))) {
{tryCatch({
start_i = my_data[my_data$id == i,]
pairs_i = data.frame(first = head(start_i$results, -1), second = tail(start_i$results, -1))
frame_i = as.data.frame(table(pairs_i))
frame_i$id = i
print(frame_i)
my_list[[i]] = frame_i
}, error = function(e){})
}}
final = do.call(rbind.data.frame, my_list)
})
Can someone please comment on the code I have written? Are these two methods essentially performing the same task at a similar level of efficiency? In general, are there any other methods I could employ to increase the efficiency of this code?
Thanks!