0

i'm trying to optimize the loop performance by replacing 'for loop' with 'foreach parallel processing loop'. I have around 1000 small dataframe with different rows and columns.

My purpose is to convert row bind all those dataframe by using 'dplyr package's bind_rows' into matrix. I have do some research online regarding basics 'foreach loop' setup and 'do parallel' such as run a for loop in parallel in R, Parallel R Loops for Windows and Linux, R - parallel computing in 5 minutes (with foreach and doParallel)

Below are more details info in my enviroment (Data Preparation)

Sample small dataframe - Note: all these small dataframe might have different rows and columns.

RYW0001_rs <- data.frame(
"A" = c("Coff", "Apple", "Coff", "Milk", "Milk", "Coff"), 
"B" = c("ToothB", "Apple", "Orange", NA, "Pear", "Grape"),
"C" = c("ToothP", "ToothP", NA, NA, "ToothB", "Yam"), 
"D" = c(NA, "Potato", NA, NA, NA, NA)
)

RYW0002_rs <- data.frame(
  "A" = c("Coff", "Apple", "Coff", "Milk", "Milk", "Coff"), 
  "B" = c(NA, "Potato", NA, NA, NA, NA)
)

RYW0003_rs <- data.frame(
  "A" = c("Coff", "Apple", "Coff", "Milk", "Milk", "Coff"), 
  "B" = c("ToothB", "Apple", "Orange", NA, "Pear", "Grape"),
  "C" = c("Apple", "ToothP", "Orange", NA, "Milk", "Grape"),
  "D" = c("ToothP", "Orange", NA, NA, "Pear", "Yam"), 
  "E" = c("ToothP", "ToothP", NA, NA, "ToothB", "Yam"), 
  "F" = c(NA, "Potato", NA, NA, NA, NA)
)

Stored dataframe as character (to be use as macro variable)

Merchant_No_rs1 <- c('RYW0001_rs','RYW0002_rs','RYW0003_rs')

Coding 1: Previous for loop [working fine, although there are some warning message like below, it wouldn't affect my expected result]

Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
3: In bind_rows_(x, .id) : Unequal factor levels: coercing to character

Step 1: Create EMPTY new temp_all file

temp <- NULL

Step 2: for loop

for (j in 1:length(Merchant_No_rs1)) {
temp <- bind_rows(temp, get(Merchant_No_rs1[[j]]))
print(dim(temp_all))
}

Coding 2: Current foreach loop [doesn't work, encounter error as below]

Error in { : task 1 failed - "object 'RYW0001_rs' not found"

Step 1: Create EMPTY new temp file

temp <- NULL

Step 2: foreach loop

foreach (j=1:length(Merchant_No_rs1), .packages=c("dplyr"), .export=sprintf("%s",Merchant_No_rs1[[j]])) %dopar% {  
temp <- bind_rows(temp, get(Merchant_No_rs1[[j]]))
} 

My Expected results would be same as outcome from coding 1, although all small data have different rows and columns, the column in the temp table will append if there is new columns. Below is the outcome table. temp

Question: Is there any way to do parallel processing using 'foreach loop' but having same result like 'do loop'?

Any help will be appreciated :) Thanks

yc.koong
  • 175
  • 2
  • 10
  • Just try `bind_rows(mget(Merchant_No_rs1))` to avoid any loop. – nicola Dec 08 '17 at 06:31
  • 2
    A `foreach` loop is fundamentally different to a `for` loop. It's more similar to a `lapply` loop. Side effects don't work in parallel. – Roland Dec 08 '17 at 06:39
  • Hi, Nicola , thanks for reply. I have try your suggestion via using mget(), but still got same error. Error in { : task 1 failed - "value for 'RYW0001_rs' not found" – yc.koong Dec 08 '17 at 06:41
  • I may have seen @Roland's comment a dozen times. Time to search google.. – F. Privé Dec 08 '17 at 10:45

0 Answers0