0

I have some data frames in a list with all having the same structure – in this example the variables a, b and c. Now I want to summarize the means of the values across the list.

# list of 10 random data frames
n <- 1e1
initSeed <- 1234
set.seed(initSeed)
(seedVec <- sample.int(n = 1e3, size = n, replace = FALSE))
lst <- lapply(1:n, function(i){
  set.seed(seedVec[i])
a <- rnorm(24,1,.1)
b <- rnorm(24,2,.2)
c <- rnorm(24,3,.3)
df <- data.frame(a,b,c)
})

I attempted to feed dplyr with lst %>% summarize_all(mean) but he won't like lists. The formula below gives me the means of each data frame in the list, but not yet the means of these variables a, b and c across all data frames.

lapply(1:10, function(n){
  lst[n] %>%
    data.frame() %>%
    summarize_all(mean)
})

So i wanted to make a new data frame with the summarized outputs in order to summarize them again, but this fails and both my extended formula and a related answer are throwing the Error in lst[[idx]] : subscript out of bounds Here is my attempt:

df1 <- as.data.frame(setNames(replicate(3,numeric(0), simplify = FALSE), 
                                 letters[1:3]))
lapply(1:10, function(n){
  lst[n] %>%
    data.frame() %>%
    summarize_all(mean) %>%
    rbind(df1, lst[n])
})

df1 %>% summarize_all(mean)

How could I get what I want?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    What should the output look like? If you want to summarize across all values, you could stack the datasets via something like `bind_rows` and then use `summarize_all`. If you want to take the mean of each dataset and then take the mean of those means (if things aren't balanced), you could use `map_df` from *purrr* for the initial loop averaging within each dataset and then use `summarize_all` on the output. – aosmith Jun 23 '17 at 15:35
  • Thanks, throwed this weird error again, but the answer from @andrew-gustar brought the solution. – jay.sf Jun 23 '17 at 15:48

1 Answers1

2

You can do this with purrr

purrr::map_df(lst, function(df){summarize_all(df,mean)})

           a        b        c
1  0.9917488 1.995821 3.121970
2  1.0007174 2.029938 2.962271
3  0.9582000 2.007167 3.046708
4  0.9745993 1.938877 3.015066
5  1.0050672 1.932359 3.052645
6  1.0196390 2.034723 2.998995
7  0.9717243 1.914532 3.024200
8  0.9954225 1.991664 2.981958
9  1.0148424 1.975775 2.949854
10 1.0014377 2.023839 2.976223

Or in base R...

t(sapply(lst,colMeans))
              a        b        c
 [1,] 0.9917488 1.995821 3.121970
 [2,] 1.0007174 2.029938 2.962271
 [3,] 0.9582000 2.007167 3.046708
 [4,] 0.9745993 1.938877 3.015066
 [5,] 1.0050672 1.932359 3.052645
 [6,] 1.0196390 2.034723 2.998995
 [7,] 0.9717243 1.914532 3.024200
 [8,] 0.9954225 1.991664 2.981958
 [9,] 1.0148424 1.975775 2.949854
[10,] 1.0014377 2.023839 2.976223
Andrew Gustar
  • 17,295
  • 1
  • 22
  • 32
  • I only needed to `summarize_all(mean)` then this is exactly what I wanted. Great answer thanks! – jay.sf Jun 23 '17 at 15:49
  • ...respectively `t(colMeans(df))`. – jay.sf Jun 23 '17 at 15:58
  • 1
    Or just `rowMeans(sapply(lst,colMeans))` if you don't need the intermediate results. – Andrew Gustar Jun 23 '17 at 16:08
  • Would you use as well `rowMeans(sapply(lst, colSd))` to get the altogether SD? (By using [this](https://stackoverflow.com/a/17549829/6574038) nice formula.) – jay.sf Jun 24 '17 at 12:35
  • 1
    No, SD doesn't work like that, partly because it is not linear (being the square root of the mean squared deviation), and partly because each of the components of that average would be based on the mean of each subsample, rather than the overall mean. If you want the overall sd by column, you need to bind the dfs together then take the sd - something like `sapply(do.call(rbind,lst),sd)` (or replace `sd` with `mean` in this to get the same overall averages as above). – Andrew Gustar Jun 24 '17 at 17:26