10

My question here is to aggregate the data collected at every 1-minute into 5-minute average.

DeviceTime         Concentration
6/20/2013 11:13       
6/20/2013 11:14
6/20/2013 11:15
6/20/2013 11:16
6/20/2013 11:17
6/20/2013 11:18
6/20/2013 11:19
6/20/2013 11:20
6/20/2013 11:21
6/20/2013 11:22
6/20/2013 11:23
6/20/2013 11:24
6/20/2013 11:25
6/20/2013 11:26
6/20/2013 11:27
6/20/2013 11:28

...

The result I want is like:

DeviceTime             Concentration
6/20/2013 11:15
6/20/2013 11:20
6/20/2013 11:25
6/20/2013 11:30
6/20/2013 11:35
...

The 5-minute average is just the simple average over the concentration in the past five minutes.

Axeman
  • 32,068
  • 8
  • 81
  • 94
Vicki1227
  • 489
  • 4
  • 6
  • 19
  • `cut` can work on date/time objects with breaks set as, say, "5 min"... – A5C1D2H2I1M1N2O1R2T1 Mar 05 '14 at 16:31
  • possible duplicate of [How to round a time?](http://stackoverflow.com/questions/17108515/how-to-round-a-time) – Jack Ryan Mar 05 '14 at 16:57
  • possible duplicate of [Aggregate values of 15 minute steps to values of hourly steps](http://stackoverflow.com/questions/17389533/aggregate-values-of-15-minute-steps-to-values-of-hourly-steps) – agstudy Mar 05 '14 at 16:59

3 Answers3

26

If your data doesn't start on a nice 5-minute wall clock boundary (as shown in your sample data – 11:13), note that cut() will create breakpoints based on the first timestamp it finds. This probably isn't what we normally want. Indeed, your sample output indicates this is not what you want.

Here's what cut() does:

df <- read.table(header=TRUE, sep=",", stringsAsFactors=FALSE, text="
DeviceTime,Concentration
6/20/2013 11:13,1
6/20/2013 11:14,1
6/20/2013 11:15,2
6/20/2013 11:16,2
6/20/2013 11:17,2
6/20/2013 11:18,2
6/20/2013 11:19,2
6/20/2013 11:20,3
6/20/2013 11:21,3
6/20/2013 11:22,3
6/20/2013 11:23,3
6/20/2013 11:24,3
6/20/2013 11:25,4")
df$DeviceTime <- as.POSIXct(df$DeviceTime, format="%m/%d/%Y %H:%M")

cut(df$DeviceTime, breaks="5 min")
 [1] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:13:00
 [4] 2013-06-20 11:13:00 2013-06-20 11:13:00 2013-06-20 11:18:00
 [7] 2013-06-20 11:18:00 2013-06-20 11:18:00 2013-06-20 11:18:00
[10] 2013-06-20 11:18:00 2013-06-20 11:23:00 2013-06-20 11:23:00
[13] 2013-06-20 11:23:00

means <- aggregate(df["Concentration"], 
                   list(fiveMin=cut(df$DeviceTime, "5 mins")),
                   mean)
means
              fiveMin Concentration
1 2013-06-20 11:13:00      1.600000
2 2013-06-20 11:18:00      2.600000
3 2013-06-20 11:23:00      3.333333

Notice that the first row of means (the 11:13:00 entry) is the mean of the first 5 rows of df, which have times of 11:13 to 11:17 -- i.e., up until just before the next cut/break point of 11:18.

You'll get the same result with dplyr (i.e., @lukeA's answer) if you use cut():

df %>%
  group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
  summarize(Concentration = mean(Concentration))
Source: local data frame [3 x 2]

           DeviceTime Concentration
1 2013-06-20 11:13:00      1.600000
2 2013-06-20 11:18:00      2.600000
3 2013-06-20 11:23:00      3.333333

The xts package seems to break by wall clock time:

require(xts)
df.xts <- xts(df$Concentration, df$DeviceTime)
means.xts <- period.apply(df.xts, endpoints(df.xts, "mins", k=5), mean)
means.xts
                    [,1]
2013-06-20 11:14:00    1
2013-06-20 11:19:00    2
2013-06-20 11:24:00    3
2013-06-20 11:25:00    4

The time values are always the last time entry found in the 5-min window. You can round the time index column up the the next 5-min boundary with align.time(), if you want to report the times of the end of the periods:

means.rounded <- align.time(means.xts, 5*60)
means.rounded
                    [,1]
2013-06-20 11:15:00    1
2013-06-20 11:20:00    2
2013-06-20 11:25:00    3
2013-06-20 11:30:00    4

You can also round down, if you want to report the times of the beginning of the periods. But you'll need to define your own function first (which I found on Cross Validated):

align.time.down = function(x,n) {
    index(x) = index(x) - n
    align.time(x,n)
}
means.rounded.down <- align.time.down(means.xts, 5*60)
means.rounded.down
                    [,1]
2013-06-20 11:10:00    1
2013-06-20 11:15:00    2
2013-06-20 11:20:00    3
2013-06-20 11:25:00    4

Another solution, that doesn't use the xts package, but rather floor(), is as follows:

df$DeviceTimeFloor <- as.POSIXct(floor(as.numeric(df$DeviceTime) / (5 * 60)) * (5 * 60), origin='1970-01-01')
meansFloor <- aggregate(Concentration ~ DeviceTimeFloor, df, mean)
meansFloor
      DeviceTimeFloor Concentration
1 2013-06-20 11:10:00             1
2 2013-06-20 11:15:00             2
3 2013-06-20 11:20:00             3
4 2013-06-20 11:25:00             4

I prefer to report the start time of the 5-minute interval – floor() is good for this. Because, if I were to report aggregates by hour, I would expect a timestamp of 2013-06-20 11:00:00 to contain data for the period 11:00:00 - 11:59:59 not 10:00:00 - 10:59:59.

If you prefer to report the end time of the intervals, ceiling() can be used instead of floor(). But note that timestamps 11:01 - 11:05 will be converted to (and hence grouped at) 11:05 by ceiling(). In contrast, floor() converts 11:00 - 11:04 to 11:00.

So they each group a different set of observations. The xts package will group the same set of observations as floor(), but it will report the last timestamp of the last observation in the period.

Community
  • 1
  • 1
Mark Rajcok
  • 362,217
  • 114
  • 495
  • 492
18

Using the dplyr package and assuming, your data is stored in a data frame named df:

require(dplyr)
df %>%
  group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
  summarize(Concentration = mean(Concentration))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • 1
    The syntax looks a little bit complicated :) – agstudy Mar 05 '14 at 17:01
  • Being a comment the layout it's a bit weird, however you can simplify your code in this way lukeA. `df %.% group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %.% summarize(Concentration = mean(Concentration))` – Gianluca Mar 05 '14 at 17:08
  • `cut()` will break on 11:13, 11:18, etc. rather than 5-min wall clock breaks (11:15, 11:20, etc.) See my answer for a solution which cuts/breaks on wall clock boundaries using package xts. – Mark Rajcok Aug 08 '15 at 21:44
2

I'd say the easiest and cleanest way to do this is using the lubridate and dplyr packages.

library(lubridate)  # for working with dates
library(dplyr)      # for manipulating data

df$DeviceTime5min <- floor_date(df$DeviceTime, "5 mins")
df_5min <- df %>% group_by(DeviceTime5min) %>% summarize(mean(Concentration))

Only problem here is that it works just for values, that fit into an hour ... that is: 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30, 60 min. But for these it works perfect :-)