2

assume I have a data frame "data" with time series data of a measured variable x:

     date            x
2009/10/01 00:00    10
2009/10/01 01:00    11
2009/10/01 02:00    12
2009/10/01 03:00    13
2009/10/01 04:00    14
2009/10/01 05:00    15
2009/10/01 06:00    16
2009/10/01 07:00    17
2009/10/01 08:00    18
2009/10/01 09:00    19
2009/10/01 10:00    20
2009/10/01 11:00    21
2009/10/01 12:00    22
2009/10/01 13:00    23
2009/10/01 14:00    24
2009/10/01 15:00    25
2009/10/01 16:00    26
2009/10/01 17:00    27
2009/10/01 18:00    28
2009/10/01 19:00    29
2009/10/01 20:00    30
2009/10/01 21:00    31
2009/10/01 22:00    32
2009/10/01 23:00    33
2009/10/02 00:00    34
...

and another data frame "events" with different time periods defined by a start and stop date:

id        start              stop
1   2009/10/01 02:00    2009/10/01 04:00
2   2009/10/01 07:00    2009/10/01 10:00
3   2009/10/01 08:00    2009/10/01 20:00
...

Now I would like to get a table of the mean values of x within the different events like this:

id  mean.x
1   13
2   18.5
3   25.5

In a database I do a simple SQL statement like this:

SELECT a.id, avg(b.x) 
FROM events as a, data as b 
WHERE b.date between a.start and a.stop 
GROUP BY a.id 

I wonder how I could do such averaging in R? I could use 'aggregate' if I had an id column in data indicating which data point falls within wich event, but I cannot find a way to create this column...

Any suggestions would be much appreciated.

cin

Edit:

dput(data):

structure(list(date = structure(c(1254348000, 1254351600, 1254355200, 
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800, 
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400, 
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000, 
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct", 
"POSIXt"), tzone = "Europe/Berlin"), x = 10:34), .Names = c("date", 
"x"), row.names = c(NA, -25L), class = "data.frame")

dput(events):

structure(list(id = 1:3, start = structure(c(1254355200, 1254373200, 
1254387600), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")

Edit2:

dput(events2):

structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800, 
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")
cin
  • 91
  • 6
  • Welcome to StackOverflow! To make your question easier to answer, I suggest using `dput(data)` and `dput(events)` to create a reproducible version of your dataframes (or perhaps just part of them, say the first 30 rows). There is more great advice [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – David Robinson Jun 20 '12 at 15:37
  • Thanks, I added the output of dput(). – cin Jun 20 '12 at 15:58

2 Answers2

4

Try this:

library(sqldf)
sqldf("
    SELECT a.id, avg(b.x) 
    FROM events as a, data as b 
    WHERE b.date between a.start and a.stop 
    GROUP BY a.id 
")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Wow, works like a charm, even for the "complication" I've added above. Feels like cheating, though ;-) And I'm not learning R from it... :-( Thanks anyway! Interesting package. – cin Jun 20 '12 at 19:11
1

You can get a list of what event each date falls into like this:

event.indices = sapply(data$date, function(d) which(d >= events$start & d <= events$stop))

The problem is that some of them will be null if they don't fall into any event. So you replace those with NA's and you're on your way:

event.indices = sapply(event.indices, function(i) ifelse(is.null(i), NA, i))

On your data, this is:

  [1] NA NA  1  1  1 NA NA  2  2  2  2  3  3  3  3  3  3  3  3  3  3 NA NA NA NA

You can then add that to your original data frame, and use aggregate just as you originally suggested.

NOTE: This is assuming that each date is guaranteed to fall into at most one event. If the events overlap (even if one begins precisely when the last one ends), you'll have to decide what to do with dates that fall under multiple events and adjust appropriately.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Thanks, looks pretty easy, actually. Guess I'll have to learn thinkint R-way... If I might, I would like to add two complications (as I realized I've actually oversimplified my example): 1. What if the event ids are character strings? Is there a way to output these to event.indices? 2. What if the events overlap like in dput(events2) I have just added? – cin Jun 20 '12 at 18:55