i'm trying to optimize the loop performance by replacing 'for loop' with 'foreach parallel processing loop'. I have around 1000 small dataframe with different rows and columns.
My purpose is to convert row bind all those dataframe by using 'dplyr package's bind_rows' into matrix. I have do some research online regarding basics 'foreach loop' setup and 'do parallel' such as run a for loop in parallel in R, Parallel R Loops for Windows and Linux, R - parallel computing in 5 minutes (with foreach and doParallel)
Below are more details info in my enviroment (Data Preparation)
Sample small dataframe - Note: all these small dataframe might have different rows and columns.
RYW0001_rs <- data.frame(
"A" = c("Coff", "Apple", "Coff", "Milk", "Milk", "Coff"),
"B" = c("ToothB", "Apple", "Orange", NA, "Pear", "Grape"),
"C" = c("ToothP", "ToothP", NA, NA, "ToothB", "Yam"),
"D" = c(NA, "Potato", NA, NA, NA, NA)
)
RYW0002_rs <- data.frame(
"A" = c("Coff", "Apple", "Coff", "Milk", "Milk", "Coff"),
"B" = c(NA, "Potato", NA, NA, NA, NA)
)
RYW0003_rs <- data.frame(
"A" = c("Coff", "Apple", "Coff", "Milk", "Milk", "Coff"),
"B" = c("ToothB", "Apple", "Orange", NA, "Pear", "Grape"),
"C" = c("Apple", "ToothP", "Orange", NA, "Milk", "Grape"),
"D" = c("ToothP", "Orange", NA, NA, "Pear", "Yam"),
"E" = c("ToothP", "ToothP", NA, NA, "ToothB", "Yam"),
"F" = c(NA, "Potato", NA, NA, NA, NA)
)
Stored dataframe as character (to be use as macro variable)
Merchant_No_rs1 <- c('RYW0001_rs','RYW0002_rs','RYW0003_rs')
Coding 1: Previous for loop [working fine, although there are some warning message like below, it wouldn't affect my expected result]
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
3: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
Step 1: Create EMPTY new temp_all file
temp <- NULL
Step 2: for loop
for (j in 1:length(Merchant_No_rs1)) {
temp <- bind_rows(temp, get(Merchant_No_rs1[[j]]))
print(dim(temp_all))
}
Coding 2: Current foreach loop [doesn't work, encounter error as below]
Error in { : task 1 failed - "object 'RYW0001_rs' not found"
Step 1: Create EMPTY new temp file
temp <- NULL
Step 2: foreach loop
foreach (j=1:length(Merchant_No_rs1), .packages=c("dplyr"), .export=sprintf("%s",Merchant_No_rs1[[j]])) %dopar% {
temp <- bind_rows(temp, get(Merchant_No_rs1[[j]]))
}
My Expected results would be same as outcome from coding 1, although all small data have different rows and columns, the column in the temp table will append if there is new columns. Below is the outcome table. temp
Question: Is there any way to do parallel processing using 'foreach loop' but having same result like 'do loop'?
Any help will be appreciated :) Thanks