Aggregate across a list of dataframes

Question

I have a fitting process that runs 100 times. Each time the output is a dataframe -- which I capture by using lapply to create a list of 100 dataframes.

The first two dataframes might look something like this (I have more than 1 column):

n1 = c(4, 5, 6) 
df1 = data.frame(n1, row.names = c("height", "weight", "favcolor"))

n2 = c(2, 3, 5, 7) 
df2 = data.frame(n2, row.names = c("height", "weight", "inseam", "favcolor"))

I would like to combine these dataframes (take the average height value, or the standard deviation of the weights, for example).

My first thought was to turn this list into a dataframe -- not going to work because arguments imply differing number of rows.

My second thought was to insert NAs for each rowname that didn't appear (so, I would be adding:

new_row <- c(NA, NA)
row.names(new_row) <- "inseam"

But I can't get that to work for a myrid of reasons, and I'm unable to even test if I add an "inseam" row that contains NAs to the end of df1, that when I take the average of the "inseam" I get the correct answer.

The correct average in this case being:

height   3
weight   4
inseam   5
favcolor 6.5

So, my question:

If you had a list of dataframes, where the row names are meaningful and need to be aggregated across like the above example, what is the best way to do that?

################

EDIT

Here is the full data from 3 of my datasets:

> mega_df[1]
[[1]]
                                 coeff     error         pval
rf2              -1.15099200 0.5752430 4.540538e-02
rf3              -0.53430218 0.4928974 2.783635e-01
rf4               0.08784138 0.4933079 8.586711e-01
rf5               0.96002601 0.5070957 5.833327e-02
rm2             -0.36188368 0.4626464 4.340949e-01
rm3              0.01805873 0.4355164 9.669251e-01
rm4              0.45008373 0.4319557 2.974268e-01
rm5              1.04056503 0.4441024 1.912556e-02
rc_cat1     0.86231928 0.2827566 2.290799e-03
rc_cat2   1.21335473 0.2448206 7.192321e-07
rc_cat3     0.96196637 0.2044198 2.528247e-06
rc_cat4     1.04477290 0.3302644 1.559142e-03
rc_cat5   -0.58902623 1.5893867 7.109357e-01
rc_cat6  0.62569607 0.1720676 2.765407e-04
rc_cat7     0.29381724 0.4115594 4.752815e-01
rs2              0.12333678 0.7186019 8.637250e-01
rs3              1.22018613 0.6423970 5.750837e-02
rs4              1.96075220 0.6454184 2.381892e-03
rs5              2.58404946 0.6543862 7.853927e-05
1|3                         0.01561497 0.4851330 9.743229e-01
3|4                         1.82853786 0.4937675 2.128663e-04
4|5                         3.73480100 0.5023435 1.047616e-13

> mega_df[2]
[[1]]
                                 coeff        error         pval
rf2              -0.23364248 5.849338e-01 6.895734e-01
rf3               0.24054894 5.219730e-01 6.449094e-01
rf4               0.84072979 5.208259e-01 1.064788e-01
rf5               1.47867154 5.346970e-01 5.684640e-03
rm2             -0.29555400 4.465509e-01 5.080612e-01
rm3              0.31147504 4.131024e-01 4.508553e-01
rm4              0.73696523 4.141224e-01 7.514424e-02
rm5              1.14273148 4.271863e-01 7.472508e-03
rc_cat1     1.27479299 3.094432e-01 3.794740e-05
rc_cat2   1.10917318 2.619011e-01 2.284654e-05
rc_cat3     0.65782540 2.161602e-01 2.340525e-03
rc_cat4     0.40512225 3.301662e-01 2.198131e-01
rc_cat5   12.78797722 5.612311e-08 0.000000e+00
rc_cat6  0.41622889 1.677804e-01 1.310894e-02
rc_cat7     0.16833629 3.806498e-01 6.583198e-01
rs2             -0.02279305 7.225878e-01 9.748360e-01
rs3              0.68299485 6.759050e-01 3.122608e-01
rs4              1.36149302 6.780722e-01 4.465519e-02
rs5              2.18484594 6.863696e-01 1.456612e-03
1|3                         0.35419237 5.844931e-01 5.445266e-01
3|4                         2.12603072 5.928308e-01 3.354863e-04
4|5                         3.97564508 5.999369e-01 3.431064e-11

> mega_df[3]
[[1]]
                                coeff     error         pval
rf2              -0.2733408 0.5884741 6.422961e-01
rf3               0.1764257 0.5257697 7.372050e-01
rf4               0.6504428 0.5248386 2.152271e-01
rf5               1.3967757 0.5356706 9.119879e-03
rm2             -0.2361284 0.4870015 6.277745e-01
rm3              0.2078729 0.4609270 6.519977e-01
rm4              0.6390950 0.4622065 1.667555e-01
rm5              1.1738653 0.4730686 1.308730e-02
rc_cat1     0.9337627 0.2958630 1.599133e-03
rc_cat2   1.0292916 0.2493133 3.651281e-05
rc_cat3     0.7088285 0.2012026 4.267587e-04
rc_cat4     0.6296966 0.3664883 8.576193e-02
rc_cat6  0.5475018 0.1720841 1.464662e-03
rc_cat7     0.4521113 0.3588440 2.077017e-01
rs2             -0.4663666 0.7031265 5.071541e-01
rs3              0.7810059 0.6489673 2.287985e-01
rs4              1.5178641 0.6522175 1.995271e-02
rs5              2.1916080 0.6578769 8.643075e-04
1|3                         0.2569225 0.4659460 5.813597e-01
3|4                         2.0648302 0.4769118 1.493906e-05
4|5                         3.9312070 0.4855339 5.648509e-16

And I'm hoping to do some basic aggregations that end up returning:

         avcoef   averror   avpval  std(coef)
rf2      W          X         Y              Z       
rf3   ...
.
.
.

score 3 · Accepted Answer · edited May 23 '17 at 12:07

3

I guess you could just create a new column in each data set which will contain the row names and then merge accordingly, something like:

l <- lapply(list(df1, df2), function(x) {x$New <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "New", all = TRUE), l)
cbind(Res[1], Means = rowMeans(Res[-1], na.rm = TRUE))
#   Row.names Means
# 1  favcolor   6.5
# 2    height   3.0
# 3    inseam   5.0
# 4    weight   4.0

This is probably highly related to this

Edit: For the new data set

l <- lapply(list(mega_df1, mega_df2, mega_df3), function(x) {x$RowName <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "RowName", all = TRUE), l)

library(data.table) ## v1.9.6+
dcast(melt(setDT(Res), "RowName"), 
      RowName ~ sub("\\..*", "", variable), 
      mean, 
      na.rm = TRUE, 
      value.var = "value")

#    RowName     coeff     error         pval
# 1:    cat1 1.0236250 0.2960209 1.309293e-03
# 2:    cat2 1.1172732 0.2520117 2.002619e-05
# 3:    cat3 0.7762068 0.2072609 9.232706e-04
# 4:    cat4 0.6931972 0.3423063 1.023781e-01
# 5:    cat5 6.0994755 0.7946934 3.554678e-01
# 6:    cat6 0.5298089 0.1706440 4.950048e-03
# 7:    cat7 0.3047549 0.3836844 4.471010e-01

edited May 23 '17 at 12:07

Community

1
1

answered Nov 06 '15 at 07:18

David Arenburg

91,361
17
137
196

So close... OK, it works perfectly if I give it `list(df1, df2)`, but if I give it `list(df1, df2, df3)` I get the following error: ```> Res <- Reduce(function(...) merge(..., by = "row.names", all=TRUE), list(mega_df[1], mega_df[3], mega_df[4])) Warning message: In merge.data.frame(..., by = "row.names", all = TRUE) : column name ‘Row.names’ is duplicated in the result > cbind(Res[1], Means = rowMeans(Res[-1], na.rm = TRUE)) Error in rowMeans(Res[-1], na.rm = TRUE) : 'x' must be numeric ``` – JBWhitmore Nov 06 '15 at 07:31
So it works for list(mega_df[1], mega_df[2]); and list(mega_df[1], mega_df[3]); but not for list(mega_df[1], mega_df[2], mega_df[3]) (the above error). Pasting in a piece of the dataframes in the next comment. – JBWhitmore Nov 06 '15 at 07:43
Ok, you were right. Merging just by row names won't work. See my edit. – David Arenburg Nov 06 '15 at 07:43
> l <- lapply(list(mega_df[1], mega_df[2]), function(x) {x$New <- row.names(x) ; x}) > Res <- Reduce(function(...) merge(..., by = "New", all=TRUE), l) Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column – JBWhitmore Nov 06 '15 at 07:49
Can you show an example of these data sets? I've tested this on several data sets and it works. Your example should be fully reproducible. – David Arenburg Nov 06 '15 at 07:52
This works perfectly fine on your data sets. See my edit above. – David Arenburg Nov 06 '15 at 07:56
I'm trying to find out where these numbers could be coming from. Cat5 gives 2.41? so it's averaging what numbers to get that? I'm hoping to get a final summary that looks like (edit in question). – JBWhitmore Nov 06 '15 at 08:08
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/94403/discussion-between-jbwhitmore-and-david-arenburg). – JBWhitmore Nov 06 '15 at 08:23
See my edit. I'm sorry, I don't have any more time to invest into this. – David Arenburg Nov 06 '15 at 08:35

Aggregate across a list of dataframes

1 Answers1