How do I create a column of means of specific columns in a data.frame?

Question

Thanks all for your responses and answers. I can see I've unintentionally left out some important details that may help you understand my problem better. I was trying to keep it simple and generic, but that didn't actually help. Here's an updated version with more information.

I have a data.frame with many columns that came from a NetLogo model generated by BehaviorSpace. Each column is a time series that represents a reported value under different experimental conditions with repetitions represented by the run number and time step number. For example (sorry this is long but I'm trying to give you a flavor for the data):

# Start by building a fake data.frame that models some of the characteristics of mine:
df <- data.frame(run = c(rep(1,5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5), rep(7,5), rep(8,5)))
df2 <- expand.grid(step = 1:5, fac.a = c(10,1000), fac.b = c(0.5,2.0))
df <- data.frame(run = df$run, rep = c(rep(1,20), rep(2,20)), step = df2$step, fac.a = df2$fac.a, fac.b = df2$fac.b)
log_growth <- function (a, b, x) {(1/(1+a*exp(-b*x))) + rnorm(1,0,0.2)}
set.seed(11)
df$treatment1 <- log_growth(df$fac.a, df$fac.b, df$step)
df$treatment2 <- log_growth(df$fac.a / 2, df$fac.b * 2, df$step)

This puts the following into df:

> df
   run rep step fac.a fac.b  treatment1  treatment2
1    1   1    1    10   0.5  0.05288201 0.356176584
2    1   1    2    10   0.5  0.12507561 0.600407158
3    1   1    3    10   0.5  0.22081815 0.804671117
4    1   1    4    10   0.5  0.33627099 0.920093934
5    1   1    5    10   0.5  0.46053940 0.971397427
6    2   1    1  1000   0.5 -0.08700866 0.009396323
7    2   1    2  1000   0.5 -0.08594375 0.018552055
8    2   1    3  1000   0.5 -0.08419297 0.042608835
9    2   1    4  1000   0.5 -0.08131981 0.102435481
10   2   1    5  1000   0.5 -0.07661880 0.232875872
11   3   1    1    10   2.0  0.33627099 0.920093934
12   3   1    2    10   2.0  0.75654214 1.002314651
13   3   1    3    10   2.0  0.88715737 1.003958435
14   3   1    4    10   2.0  0.90800192 1.003988593
15   3   1    5    10   2.0  0.91089154 1.003989145
16   4   1    1  1000   2.0 -0.08131981 0.102435481
17   4   1    2  1000   2.0 -0.03688314 0.860350536
18   4   1    3  1000   2.0  0.19880473 1.000926458
19   4   1    4  1000   2.0  0.66014952 1.003932891
20   4   1    5  1000   2.0  0.86791705 1.003988125
21   5   2    1    10   0.5  0.05288201 0.356176584
22   5   2    2    10   0.5  0.12507561 0.600407158
23   5   2    3    10   0.5  0.22081815 0.804671117
24   5   2    4    10   0.5  0.33627099 0.920093934
25   5   2    5    10   0.5  0.46053940 0.971397427
26   6   2    1  1000   0.5 -0.08700866 0.009396323
27   6   2    2  1000   0.5 -0.08594375 0.018552055
28   6   2    3  1000   0.5 -0.08419297 0.042608835
29   6   2    4  1000   0.5 -0.08131981 0.102435481
30   6   2    5  1000   0.5 -0.07661880 0.232875872
31   7   2    1    10   2.0  0.33627099 0.920093934
32   7   2    2    10   2.0  0.75654214 1.002314651
33   7   2    3    10   2.0  0.88715737 1.003958435
34   7   2    4    10   2.0  0.90800192 1.003988593
35   7   2    5    10   2.0  0.91089154 1.003989145
36   8   2    1  1000   2.0 -0.08131981 0.102435481
37   8   2    2  1000   2.0 -0.03688314 0.860350536
38   8   2    3  1000   2.0  0.19880473 1.000926458
39   8   2    4  1000   2.0  0.66014952 1.003932891
40   8   2    5  1000   2.0  0.86791705 1.003988125

So what I did before is split up the data frame using by and wanted to obtain averages and standard deviations for every step (it's a time series) and each combination of factors.

After having looked at all your answers and having reconsidered my problem, I think what I'm trying to do would be better handled during the conversion process of by. I'm not exactly sure how to do that... What I want the output to look like is a summary of sorts:

> df
   run fac.a fac.b  mean.treatment1  mean.treatment2 sd.treatment1 sd.treatment2
1    1    10   0.5        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1    10   2.0        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1  1000   0.5        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1  1000   2.0        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx

Is this a job for aggregate? Thanks for your patience and help. -- Glenn

Original question:

I have a data.frame with many columns, each of which represents a specific experimental condition with repetitions.

> df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5), mean.a = 0, mean.b = 0, mean.1 = 0, mean.2 = 0)
> df
        a.1       b.1       a.2       b.2 mean.a mean.b   sd.a   sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827      0      0      0      0
2 0.4171254 0.4883140 0.8282384 0.1215129      0      0      0      0
3 0.2291582 0.9419946 0.4089008 0.5665242      0      0      0      0
4 0.3807868 0.1889066 0.8271075 0.4022014      0      0      0      0
5 0.5863078 0.4991847 0.4082745 0.5637367      0      0      0      0

I want to find means and standard deviations for each condition and repetition. So far the most direct way seems to be:

for (i in c("a.1", "a.2") {df$mean.a <- df$mean.a + df[[i]]}
df$mean.a <- df$mean.a / 2

But I have a lot of columns, and they are all over the place, so this seems really labor intensive and manual. A little nicer method is to use ave():

df$mean.a <- with (df, ave(a.1, a.2))

But if I want to do sd() instead, I mysteriously get NAs:

df$sd.a <- with (df, ave(a.1, a.2, FUN = sd))
> df
        a.1       b.1       a.2       b.2    mean.a mean.b   sd.a   sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827 0.9209433      0     NA      0
2 0.4171254 0.4883140 0.8282384 0.1215129 0.4171254      0     NA      0
3 0.2291582 0.9419946 0.4089008 0.5665242 0.2291582      0     NA      0
4 0.3807868 0.1889066 0.8271075 0.4022014 0.3807868      0     NA      0
5 0.5863078 0.4991847 0.4082745 0.5637367 0.5863078      0     NA      0

I would prefer not to use external packages if possible, but it seems like I'm missing something basic. This question was similar, but had to do with data.tables, not data.frames.

Another was even closer, but using ave() is also tedious to specify, for instance, columns 1-12, 15-17, and 26 as the subject columns, and mysteriously, sd() produces those NA's. Seems like there should be a straightforward way to do this. Almost makes me wish for Excel. :-)

Your data layout is painful. You should use `reshape2' to bring the data into a more easy to handle format. — Peter Lustig, Sep 17 '14 at 21:48
Do you understand what `ave` is doing? You doing `mean` and `sd` over one value each time. Didn't you notice that `mean.a` column is exactly the same as `a.1`? The reason you are getting "mysterious" `NA` from `sd` function, is because you are trying to calculate SD of one value. Try, for example, `sd(1)`. I would suggest you make your example more reproducible and also add some desired output please — David Arenburg, Sep 17 '14 at 21:50
If you use `set.seed()` in your example, then we can [reproduce](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) the same values. Can you also provide the desired output for your sample data? When you say you have many more combinations, what exactly does that mean? That the prefix before the dot and suffix after have more combinations. And you're trying to calculate per-row means and standard deviations? Is there a reason you are forcing such an awkward/stacked data structure? — MrFlick, Sep 17 '14 at 21:50

score 2 · Accepted Answer · answered Sep 17 '14 at 22:04

Let us first bring your data into an acceptable format. Note that this solution does, against your initial requirements, indeed rely on external libraries, but they are very common and true timesavers today! (plyr and reshape2 by Hadley Wickham, who is a phenomenon in the R community)

# Note how I only used the data columns, initially, there is no mean and sd column in the data frame used at this stage.
df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))

df$repetition = c(1:nrow(df))
library(reshape2)
tmp = melt(df, id.vars = "repetition")
names(tmp)[2] = "condition"

tmp$treatment = substring(tmp$condition,1,1)

This yields:

> head(tmp)
  repetition condition     value treatment
1          1       a.1 0.6668952         a
2          2       a.1 0.1248151         a
3          3       a.1 0.7082199         a
4          4       a.1 0.9840956         a
5          5       a.1 0.4479190         a
6          1       b.1 0.9381539         b

Now, the rest is easy, we rely on the popular plyr package:

library(plyr)
results = ddply(tmp, .(repetition, treatment), summarize, mean = mean(value), sd = sd(value) )

The final result is

> head(results)
  repetition treatment      mean         sd
1          1         a 0.6777342 0.01532853
2          1         b 0.6734955 0.37428353
3          2         a 0.4533126 0.46456561
4          2         b 0.8441925 0.07260509
5          3         a 0.3967338 0.44050779
6          3         b 0.5886821 0.42635902

Let's hope this is what you were looking for.

One more interesting addition, if you do not want to differentiate each repetition, but rather on a treatment level

# addition
results = ddply(tmp, .( treatment), summarize, mean = mean(value), sd = sd(value) )

and the result:

> head(results)
  treatment      mean        sd
1         a 0.5817867 0.2954151
2         b 0.6212537 0.3219035

Thanks, Peter. I was trying to avoid the "complication" of learning more packages, but you have showed how `plyr` and `reshape2` actually make things a lot simpler, and they are very commonly used. I just haven't gotten around to using them. Thanks for the great example. — theoden, Sep 19 '14 at 02:21

score 1 · Answer 2 · answered Sep 17 '14 at 23:24

Ignoring the "base-only" requirement to whip the data into shape, using tidyr and the pipe operator from magrittr:

set.seed(42)
df  <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
df2 <- df %>%
  gather(treatment, value) %>%
  separate(treatment, c("treatment", "repetition"))
head(df2)
#    treatment repetition      value
# 1          a          1 0.13871017
# 2          a          1 0.98889173
# 3          a          1 0.94666823
# 4          a          1 0.08243756
# 5          a          1 0.51421178
# 6          b          1 0.39020347

Now, I'm not sure what exactly you're trying to get the average and standard deviation of, but one easy option is aggregate() from base R. Simple pass the function you'd like through the FUN parameter:

# calculate mean on treatment (a or b)
aggregate(df2$value, by = list(treatment = df2$treatment), FUN = mean)
#   treatment repetition         x
# 1         a          1 0.5341839
# 2         b          1 0.6633022
# 3         a          2 0.5442395
# 4         b          2 0.4225865

# calculate mean on treatment and repetition
aggregate(df2$value, by = list(treatment = df2$treatment, repetition = df2$repetition, FUN = mean)
#   treatment         x
# 1         a 0.5392117
# 2         b 0.5429444

score 1 · Answer 3 · answered Sep 18 '14 at 04:10

Based on the code you showed, may be this base R method would help:

 set.seed(42)
 df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
   do.call(cbind,
     lapply(split(seq_along(df),gsub("\\..*", "",colnames(df))), function(x) {
        x1 <- df[,x]
        data.frame(Means=rowMeans(x1, na.rm=TRUE), SD=apply(x1, 1, sd, na.rm=TRUE))}))
  #  a.Means      a.SD   b.Means       b.SD
  #1 0.6862739 0.3231932 0.7295552 0.29763438
  #2 0.8280938 0.1541232 0.8574074 0.17086395
  #3 0.6104059 0.4585819 0.1260770 0.01214755
  #4 0.5429382 0.4065997 0.5659947 0.12869005
  #5 0.5520192 0.1268922 0.6326988 0.10234101

Using your code, I get the same result

  vec1 <- vector("numeric", length=5)
  for(i in c("a.1", "a.2")) {vec1 <- vec1+df[[i]]}
  vec1/2
  #[1] 0.6862739 0.8280938 0.6104059 0.5429382 0.5520192

How do I create a column of means of specific columns in a data.frame?

3 Answers3

Linked