Processing a list of dataframes in R

Question

I am trying to run a simulation in parallel.

iterations = 50000
sim = foreach(i=1:iterations) %dopar% sim(dataframe, ... )

Each item in the list sim is a dataframe with 40 columns and 100 rows. Each dataframe has an ID column. I want to determine the average score by ID over 50,000 simulations.

I tried the following, but it was quite slow, I think because it had to constantly regrow the dataframe:

results = do.call(rbind.data.frame, sim)
avg.scores = ddply(sim,~Player,summarise,mean=mean(score))

I also tried to set the attributes on the list to convert it to a dataframe in place (Most efficient list to data.frame method?), but ended up with way more than 25 columns and different column names

I am not sure whether there is a way to calculate the averages by iterating over the lists or whether I need to create a dataframe or datatable first, and then calculate the averages.

Thank you for any suggestions!

Use `rbindlist` drom data.table package – Metrics Mar 13 '15 at 21:01 — Metrics, Mar 13 '15 at 21:01

score 1 · Answer 1 · answered Mar 13 '15 at 21:09

If the IDs are scattered throughout the dataframes in the list, then yes you would have to have an extra step to pull all similar IDs into respective tables or just combine and group by.

You can speed things up a little bit by using data.table and .combine: (NOTE: it is also bad practice to call your output the same as a function)

library(doParallel)
library(plyr)
library(data.table)
iter <- 5E4
simulations <- foreach(i = 1:iter, .combine = rbind) %dopar% {
  data.table(ID = sample(LETTERS, 100, replace = TRUE), matrix(runif(390), ncol = 39))
}

You can then take the mean of one or more columns by using .SD:

means.by.ID.and.column <- simulations[, lapply(.SD, mean), by = ID, .SDcols = 2:40]

@user515663 does this answer your question? If do please accept, if not, how can I improve it? — mlegge, Apr 13 '15 at 14:52

Processing a list of dataframes in R

1 Answers1