Averaging an increasing number of columns of a dataframe

Question

I have a data frame (wc2) with 7 columns:

    cm5      cm10      cm15      cm20      cm25      cm30       run_time
1 0.1221060 0.1221060 0.1221060 0.1221060 0.1221060 0.1221060        0
2 0.4084525 0.4028010 0.3617393 0.2595060 0.1294412 0.1220099        2
3 0.4087809 0.4042515 0.3711077 0.3119956 0.2241836 0.1290348        4
4 0.4088547 0.4045780 0.3732053 0.3218224 0.2611785 0.1720426        6
5 0.4088770 0.4046887 0.3739936 0.3255557 0.2739738 0.2081264        8
6 0.4088953 0.4047649 0.3744183 0.3273794 0.2798225 0.2273250       10

For every row (run_time) I want to average first the 1st column, then the 1st and 2nd columns, then the 1st, 2nd and 3rd columns and so on until the 6th column. The averaged result should be in a new column or a new data frame (I don't mind). I did it using the following code:

wc2$dia10 <- wc2$cm5
wc2$dia20 <- rowMeans(wc2[c("cm5", "cm10")])
wc2$dia30 <- rowMeans(wc2[c("cm5", "cm10", "cm15")])
wc2$dia40 <- rowMeans(wc2[c("cm5", "cm10", "cm15", "cm20")])
wc2$dia50 <- rowMeans(wc2[c("cm5", "cm10", "cm15", "cm20", "cm25")])
wc2$dia60 <- rowMeans(wc2[c("cm5", "cm10", "cm15", "cm20", "cm25", "cm30")])

From my basic knowledge of R I gather there is a much better way of doing that but I can't figure out how. Especially for when I'll have a bigger number of columns. I had a look at the answer for "Sum over and increasing number of columns of a data frame in R" but couldn't understand or apply it to my data.

Thanks for any help

Sotos · Accepted Answer · 2017-08-11T11:57:09.273

6

You can use Reduce with accumulate = TRUE argument as follows,

sapply(Reduce(c, 1:(ncol(df)-1), accumulate = TRUE)[-1], function(i) rowMeans(df[i]))

Or to get the exact output,

setNames(data.frame(df[1],sapply(Reduce(c, 1:(ncol(df)-1),accumulate = TRUE)[-1], function(i) 
           rowMeans(df[i]))), paste0('dia', seq(from = 10, to = ncol(df[-1])*10, by = 10)))

Or as @A5C1D2H2I1M1N2O1R2T1 suggests in comments,

do.call(cbind, setNames(lapply(1:6, function(x) rowMeans(df[1:x])),
                                        paste0("dia", seq(10, 60, 10)))

Both giving,

    dia10     dia20     dia30     dia40     dia50     dia60
1 0.1221060 0.1221060 0.1221060 0.1221060 0.1221060 0.1221060
2 0.4084525 0.4056268 0.3909976 0.3581247 0.3123880 0.2806583
3 0.4087809 0.4065162 0.3947134 0.3740339 0.3440639 0.3082257
4 0.4088547 0.4067164 0.3955460 0.3771151 0.3539278 0.3236136
5 0.4088770 0.4067829 0.3958531 0.3782787 0.3574178 0.3325359
6 0.4088953 0.4068301 0.3960262 0.3788645 0.3590561 0.3371009

Or to add it to the original data frame, then,

cbind(df, setNames(lapply(1:6, function(x) rowMeans(df[1:x])),
                                    paste0("dia", seq(10, 60, 10))))

edited Aug 11 '17 at 11:57

answered Aug 11 '17 at 11:33

Sotos

51,121
6
32
66

1

Or just `setNames(lapply(1:6, function(x) rowMeans(mydf[1:x])),paste0("dia", seq(10, 60, 10)))` and `cbind` with the original data. – A5C1D2H2I1M1N2O1R2T1 Aug 11 '17 at 11:37
@A5C1D2H2I1M1N2O1R2T1 I see 1...1, 2...1, 2, 3...my brain says accumulate, accumulate! :p – Sotos Aug 11 '17 at 11:45
1

This works well! I don't understand how I didn't figure it out. I thought about the lapply function but couldn't find the way to use it. – Jack Aug 11 '17 at 12:08
Can someone please explain why for the standalone data frame the "do.call" is needed and for adding to the original data frame only the "cbind" is needed? – Jack Aug 11 '17 at 12:22
Well, `do.call` (from documentation) is used in the first case in order to construct and execute the function call (see [here for more info](https://stackoverflow.com/a/10801883/5635580)). On the second case, we are calling to `cbind` two lists (data frames are lists) directly – Sotos Aug 11 '17 at 12:30
@Jack, you actually don't need `do.call`. Each `list` item represents a column in a `data.frame` format, so you can actually just do `data.frame(setNames(lapply(...), ...))`. – A5C1D2H2I1M1N2O1R2T1 Aug 12 '17 at 05:55

lmo · Answer 2 · 2017-08-11T13:09:25.370

2

Here is an alternative method with apply and cumsum. Using rowMeans is almost surely preferable, but this method runs through the calculation in one pass.

setNames(data.frame(t(apply(dat[1:6], 1, cumsum) / 1:6)),
         paste0("dia", seq(10, 60, 10)))
      dia10     dia20     dia30     dia40     dia50     dia60
1 0.1221060 0.1221060 0.1221060 0.1221060 0.1221060 0.1221060
2 0.4084525 0.4056268 0.3909976 0.3581247 0.3123880 0.2806583
3 0.4087809 0.4065162 0.3947134 0.3740339 0.3440639 0.3082257
4 0.4088547 0.4067164 0.3955460 0.3771151 0.3539278 0.3236136
5 0.4088770 0.4067829 0.3958531 0.3782787 0.3574178 0.3325359
6 0.4088953 0.4068301 0.3960262 0.3788645 0.3590561 0.3371009

Using the smarter Reduce("+" with accumulate suggested by @alexis-laz, we could do

mapply("/", Reduce("+", dat[1:6], accumulate = TRUE), 1:6)

or to get a data.frame with the desired names

setNames(data.frame(mapply("/", Reduce("+", dat[1:6], accumulate = TRUE), 1:6)),
         paste0("dia", seq(10, 60, 10)))

The uglier code below follows the same idea, without mapply

setNames(data.frame(Reduce("+", dat[1:6], accumulate = TRUE)) /
                    rep(1:6, each=nrow(dat)), paste0("dia", seq(10, 60, 10)))

edited Aug 11 '17 at 13:09

answered Aug 11 '17 at 12:07

lmo

37,904
9
56
69

2

An alternative to `apply(, 1, cumsum)` is to use `Reduce`: `mapply("/", Reduce("+", dat[1:6], accumulate = TRUE), 1:6)`. I believe, generally, using this cumulative sum approach should be more efficient as it avoids re-adding same columns – alexis_laz Aug 11 '17 at 12:20
It works as well though I prefer using shorter code lines :S Makes it easier for me to understand. – Jack Aug 11 '17 at 12:26
Thanks, @alexis_laz. That is awesome. I forgot about the `Reduce("+", ...` with accumulate method (even though its close relative is in soto's answer). – lmo Aug 11 '17 at 12:28
@alexis_laz That is a nice suggestion. You should add that somewhere (either answer it, or add it to mine or lmo's answer)...Idk...whatever...just get it out there :) – Sotos Aug 11 '17 at 12:33
1

@alexis_laz I added your answer above, with a modification to return a data.frame, but if you want to post, I'll delete it and certainly up vote your post. – lmo Aug 11 '17 at 12:38
Thanks, @Jack. Got it fixed. – lmo Aug 11 '17 at 13:09

Averaging an increasing number of columns of a dataframe

2 Answers2