How to calculate the mean for a column subsetted from the data

Question

This shouldn't be too hard, but I always have issues when tying to run calculations on a column in a dataframe that relies on the value of a another column in the data frame. Here is my data.frame

          stream      reach length.km length.m total.sa pools.sa
1           Stream Reach_Code       109      109        1        1
2           Brooks    BRK_001        17       14      108       13
3           Brooks    BRK_002        15       12       99        9
4           Brooks    BRK_003        24       21       94       95
5           Brooks    BRK_004        32       29       97       33
6           Brooks    BRK_005        27       24       92       79
7           Brooks    BRK_006        26       23       95        6
8           Brooks    BRK_007        16       13       77       15
9           Brooks    BRK_008        29       26       84       26
10          Brooks    BRK_009        18       15       87       46
11          Brooks    BRK_010        23       20       88       47
12          Brooks    BRK_011        22       19       91       40
13          Brooks    BRK_012        30       27       98       37
14          Brooks    BRK_013        25       22       93       29
19 Buncombe_Hollow   BNH_0001         7        4       75       65
20 Buncombe_Hollow   BNH_0002         8        5       66       21
21 Buncombe_Hollow   BNH_0003         9        6       68       53
22 Buncombe_Hollow   BNH_0004        19       16       81       11
23 Buncombe_Hollow   BNH_0005         6        3       65       27
24 Buncombe_Hollow   BNH_0006        13       10       63       23
25 Buncombe_Hollow   BNH_0007        12        9       71       57

I would like to calculate the mean of a column (lets say length.m) where stream = Brooks and then do the same thing for stream = Buncombe_Hollow. I actually have 17 different stream names, and plan on calculating the mean of some column for each stream. I will then store these means as a vector, and bind them to another vector of the stream names, so the end result is something like this

    stream  truevalue
1   Brooks  0.9440620
2   Siouxon 0.5858527
3   Speelyai    0.5839844

Thanks!

What is the data on the first row in your example dataset? Is it significant for the calculations, should it be dropped? do you have similar rows later in your set? — erasmortg, Aug 07 '15 at 21:28
Thank you for catching that, I have removed it. Artifact from the CSV. I see another issue, most of my values are 0.34355, but when I import the CSV into R, it is rounding them and making a whole number, for example 0.34355 = 34 — Christopher, Aug 07 '15 at 21:36
that should not be a problem, but if you want more digits printed you could try `options(digits=10)` or so, see here http://stackoverflow.com/questions/4540649/retain-numerical-precision-in-an-r-data-frame — erasmortg, Aug 07 '15 at 21:38
this is a major problem. It is taking a decimal and converting it into a whole number! — Christopher, Aug 07 '15 at 21:42
`with(df, tapply(length.m, stream, mean))` would be a bit faster than `aggregate` — Rich Scriven, Aug 07 '15 at 21:43
I fixed it. it seems as thought this bit of code was really changing things `cols = c(3:27) > data[,cols] = apply(data[,cols],2,function(x) as.numeric(as.factor(x)))` — Christopher, Aug 07 '15 at 21:51

RichAtMango · Answer 1 · 2015-08-07T21:32:13.360

4

try using aggregate:

# Generate some data to use
someDf <- data.frame(stream = rep(c("Brooks", "Buncombe_Hollow"), each = 10),
  length.m = rpois(20, 4))

# Calculate the means with aggregate
with(someDf, aggregate(list(truevalue = length.m), list(stream = stream), mean))

The reason for the "list" bits is to specifically name the columns in the (data frame) output

edited Aug 07 '15 at 21:32

answered Aug 07 '15 at 21:27

RichAtMango

376
1
6

score 2 · Accepted Answer · answered Aug 07 '15 at 21:28

2

Start using the dplyr package. It makes such calculations quick as well as very easy to write

library(dplyr)
result <- data %>% group_by(stream) %>% summarize(truevalue = mean(length.m))

answered Aug 07 '15 at 21:28

Rohit Das

1,962
3
14
23

How to calculate the mean for a column subsetted from the data

2 Answers2