R: For a timespan of some years compute the mean of measured values in every calendary month

Question

I got a data.frame like the one at the end (df) that I found here. I want to calculate the mean of e.g. df$PH for every month. I messed around quite a while and ended up with the following (with help from this question):

library(lubridate)
library(zoo)

# Get first month where measuremets have taken place
first <- floor_date(min(df$TIME), unit = "months")

# Calculate number of (calendary) months between first and last measurement
num.months <- ceiling(abs(as.yearmon(strptime(min(df$TIME), format = "%Y-%m-%d"))
                          - as.yearmon(strptime(max(df$TIME), format = "%Y-%m-%d"))) * 12)

# Calculate means of pH in every (calendary) month
my <- c()
for(i in 1:num.months){
  mon <- interval(start = first %m+% months(i - 1), end = first %m+% months(i))
  my <- c(my, mean(df[which(df$TIME %within% mon), "PH"], na.rm = T))
}

I think there is most probably an easier solution. Something like:

for(months in interval("2000-07-26", "2014-03-06")){
  mean(df[which(df$TIME %within% months), "PH"])
}

Do you know any?
(Otherwise people can at least find this one.)

df <- structure(list(TIME = structure(c(11164, 11296, 11296, 11296, 
11330, 11330, 11547, 11547, 11547, 11547, 11639, 11639, 11639, 
11639, 11758, 11758, 11758, 11758, 11869, 11869, 11869, 11869, 
11968, 11968, 11968, 11968, 12009, 12009, 12009, 12102, 12102, 
12102, 12102, 12102, 12193, 12193, 12193, 12193, 12193, 12193, 
12193, 12282, 12282, 12282, 12282, 12282, 12282, 12368, 12368, 
12368, 12368, 12500, 12500, 12500, 12500, 12500, 12500, 12500, 
12590, 12590, 12590, 12590, 12590, 12590, 12590, 12684, 12684, 
12684, 12684, 12684, 12800, 12800, 12800, 12800, 12800, 12891, 
12891, 12891, 12891, 12891, 12891, 12891, 12998, 12998, 12998, 
12998, 13101, 13101, 13101, 13101, 13101, 13273, 13273, 13273, 
13273, 13362, 13362, 13362, 13362, 13459, 13459, 13459, 13459, 
13459, 13459, 13557, 13557, 13557, 13557, 13557, 13557, 13557, 
13654, 13654, 13654, 13654, 13654, 13654, 13767, 13767, 13767, 
13767, 13767, 13767, 13767, 13857, 13857, 13857, 13857, 13857, 
13857, 13857, 13949, 13949, 13949, 13949, 13949, 13949, 13949, 
14048, 14048, 14048, 14048, 14048, 14048, 14048, 14398, 14398, 
14398, 14398, 14398, 14755, 14755, 14755, 14755, 14755, 14755, 
14755, 14943, 14943, 14943, 14943, 14943, 14943, 15806, 15806, 
15806, 15806, 15806, 15806, 15806, 16135, 16135, 16135, 16135, 
16135, 16135), class = "Date"), GAUGE_ID = c(198L, 200L, 199L, 
198L, 202L, 198L, 204L, 198L, 203L, 201L, 203L, 204L, 201L, 198L, 
201L, 198L, 203L, 204L, 203L, 201L, 204L, 198L, 203L, 204L, 201L, 
198L, 199L, 202L, 200L, 203L, 201L, 204L, 198L, 199L, 203L, 201L, 
202L, 204L, 198L, 200L, 199L, 201L, 200L, 199L, 198L, 203L, 204L, 
198L, 201L, 203L, 204L, 203L, 204L, 202L, 201L, 200L, 199L, 198L, 
203L, 204L, 202L, 198L, 199L, 200L, 201L, 198L, 199L, 201L, 203L, 
204L, 199L, 198L, 201L, 203L, 204L, 203L, 200L, 199L, 198L, 204L, 
202L, 201L, 198L, 203L, 201L, 204L, 199L, 198L, 201L, 203L, 204L, 
201L, 198L, 203L, 204L, 201L, 198L, 203L, 204L, 198L, 199L, 201L, 
203L, 204L, 200L, 198L, 199L, 201L, 202L, 203L, 204L, 200L, 204L, 
200L, 203L, 201L, 199L, 198L, 204L, 202L, 201L, 199L, 198L, 203L, 
200L, 201L, 202L, 203L, 204L, 200L, 199L, 198L, 200L, 204L, 203L, 
202L, 201L, 199L, 198L, 204L, 203L, 202L, 201L, 199L, 200L, 198L, 
199L, 201L, 204L, 203L, 202L, 200L, 203L, 202L, 201L, 199L, 204L, 
198L, 204L, 203L, 202L, 201L, 200L, 198L, 198L, 199L, 201L, 200L, 
203L, 204L, 202L, 198L, 199L, 200L, 202L, 204L, 201L), PH = c(7.22, 
7.2, 6.9, 7.11, 7.01, 7.03, 7.03, 7.17, 6.99, 6.99, 6.85, 6.84, 
6.88, 6.95, 6.94, 7.04, 6.93, 6.93, 7.21, 7.23, 7.19, 7.25, 7.07, 
7.09, 7.09, 7.16, 7.43, 7.47, 7.41, 7.08, 7.05, 7.07, 7.1, 7.15, 
7.12, 7.16, 7.25, 7.2, 7.26, 7.49, 7.62, 7.12, 7.16, 7.44, 7.3, 
7.11, 7.2, 7.21, 7.14, 7.06, 6.66, 7.28, 7.29, 7.42, 7.22, 7.63, 
7.52, 7.35, 7.05, 7.23, 7.16, 7.32, 7.3, 7.22, 7.19, 7.34, 7.55, 
7.27, 7.26, 7.36, 7.42, 7.35, 7.2, 7.3, 7.24, 7.19, 7.45, 7.5, 
7.36, 7.21, 7.34, 7.25, 7.11, 7.01, 7.03, 6.95, 7.36, 7.3, 7.2, 
7.21, 7.2, 7.21, 7.23, 7.16, 7.12, 7.19, 7.29, 7.19, 7.14, 5.31, 
6.99, 7.15, 7.16, 6.87, 7.12, 7.3, 7.67, 7.28, 7.54, 7.29, 7.45, 
7.44, 7.17, 7.13, 7.13, 7.14, 7.68, 7.6, 6.89, 7.42, 7.3, 7.4, 
7.28, 7.29, 7.23, 7.21, 7.44, 7.21, 7.4, 7.26, 7.48, 7.29, 7.33, 
7.08, 7.27, 7.57, 7.28, 7.69, 7.44, 7.31, 7.36, 7.4, 7.36, 7.47, 
7.75, 7.5, 6.97, 7.08, 6.96, 7.1, 7.07, 7.13, 7.38, 7.11, 7.22, 
7.19, 7.13, 7.29, 7.12, 7.03, 7.09, 7.23, 7, 7.11, 7.85, 7.8, 
7.36, 8.01, 7.03, 7.33, 7.32, 7.31, 7.56, 7.27, 7.32, 7.08, 7.07
), EH = c(100L, 470L, 462L, 1L, 427L, -6L, 118L, -24L, 57L, -12L, 
93L, 136L, 119L, 101L, 128L, 32L, 35L, 137L, -4L, 107L, 84L, 
79L, -1L, 71L, 108L, 5L, 467L, 450L, 470L, 11L, 111L, 80L, -2L, 
160L, -16L, 148L, 215L, 103L, -45L, 260L, 300L, 141L, 288L, 335L, 
64L, 85L, 120L, 86L, 129L, 72L, 144L, 49L, 108L, 335L, 170L, 
285L, 308L, 5L, 49L, 108L, 335L, 5L, 308L, 285L, 170L, 78L, 282L, 
145L, 70L, 146L, 325L, 92L, 137L, 94L, 137L, 51L, 283L, 290L, 
77L, 116L, 287L, 143L, 87L, 81L, 129L, 117L, 243L, 29L, 114L, 
-5L, 105L, 131L, 121L, 60L, 88L, 127L, 69L, 62L, 122L, 64L, 450L, 
128L, 45L, 117L, 340L, 287L, 436L, 165L, 406L, 122L, 175L, 441L, 
136L, 325L, 81L, 148L, 285L, 118L, 209L, 186L, 204L, 433L, 176L, 
153L, 420L, 182L, 380L, 150L, 195L, 379L, 455L, 143L, 391L, 379L, 
178L, 339L, 165L, 435L, 125L, 161L, 128L, 416L, 173L, 381L, 420L, 
130L, 333L, 153L, 151L, 79L, 246L, 428L, 147L, 330L, 161L, 217L, 
155L, 75L, 139L, 104L, 164L, 167L, 302L, 69L, 223L, 226L, 206L, 
331L, 330L, 485L, 237L, 111L, 291L, 386L, 64L, 155L, 170L)), .Names = c("TIME", 
"GAUGE_ID", "PH", "EH"), class = "data.frame", row.names = c(NA, 
-177L))

Jaime Caffarel · Answer 1 · 2016-12-27T19:42:57.180

1

If you just want to calculate the mean of a given column for every month, you can try with data.table, I think it would be easier.

library(data.table)
df <- as.data.table(df)
df[, .(meanByMonth = mean(PH)), by = month(TIME)][order(month)]

    month meanByMonth
 1:     1    7.221429
 2:     2    7.285000
 3:     3    7.284167
 4:     4    7.428571
 5:     5    7.255000
 6:     6    7.252632
 7:     7    7.220000
 8:     8    7.134444
 9:     9    7.299167
10:    10    7.102500
11:    11    7.048214
12:    12    7.250000

P.S. This cheat sheet is very useful.

edited Dec 27 '16 at 19:42

answered Dec 27 '16 at 18:26

Jaime Caffarel

2,401
4
30
42

It's amazing what `data.table` can do! I have to study it as soon as possible! – mattu Dec 27 '16 at 19:34
1

Yeah, it is. I have edited the answer with a cheat sheet that I find particularly useful when working with data.tables. I hope it helps. – Jaime Caffarel Dec 27 '16 at 19:44

lmo · Accepted Answer · 2016-12-28T12:36:14.477

You can use aggregate in base R and a little date formatting "magic" to calculate this.

For every month-year, you could use:

aggregate(PH~format(TIME, "%Y-%m"), data=df, FUN=mean)

which returns

   format(TIME, "%Y-%m")       PH
1                2000-07 7.220000
2                2000-12 7.070000
3                2001-01 7.020000
4                2001-08 7.045000
5                2001-11 6.880000
6                2002-03 6.960000
7                2002-07 7.220000
...
31               2010-05 7.207143
32               2010-11 7.096667
33               2013-04 7.528571
34               2014-03 7.268333

For every month, across years, you could use

aggregate(PH~format(TIME, "%m"), data=df, FUN=mean)

which returns

   format(TIME, "%m")       PH
1                  01 7.221429
2                  02 7.285000
3                  03 7.284167
4                  04 7.428571
5                  05 7.255000
6                  06 7.252632
7                  07 7.220000
8                  08 7.134444
9                  09 7.299167
10                 10 7.102500
11                 11 7.048214
12                 12 7.250000

As the column names may not be what you want you want, you could wrap aggregate in setNames and provide nicer column names or do so in the next line.

setNames(aggregate(PH~format(TIME, "%m"), data=df, FUN=mean), c("months", "meanPH"))

you can easily add more columns by including cbind into aggregate:

aggregate(cbind(PH, EH)~format(TIME, "%m"), data=df, FUN=mean)

alternatively, you could use the standard interface to aggregate, where a list in the second argument determines the grouping.

aggregate(df[c("PH", "EH")], list(format(df$TIME, "%m")), FUN=mean)

Is there a way to let aggregate return a `data.frame` with the pH means as well as the EH means (another column in the example data)? — mattu, Dec 28 '16 at 11:21
Otherwise it's clear how to do it via `cbind`: `cbind(aggregate(PH ~ format(TIME, "%Y-%m"), data=df, FUN=mean), (aggregate(EH ~ format(TIME, "%Y-%m"), data=df, FUN=mean)[,2])`. — mattu, Dec 28 '16 at 11:27
See my edit at the bottom. It incorporates `cbind` into the `aggregate` function using the formula interface to `aggregate` or shows an alternative to the formula interface. — lmo, Dec 28 '16 at 12:39

R: For a timespan of some years compute the mean of measured values in every calendary month

2 Answers2