3

I have recently posted a similar question here, which was a bit oversimplified, though. Thus here we go again:

Assume I have a dataframe (dput output below) with time series data of many different variables (5 in this example, many more in the real data):

          date          a  b  c  d  e
1  2009-10-01 00:00:00 10 20 30 40 50
2  2009-10-01 01:00:00 11 21 31 41 51
3  2009-10-01 02:00:00 12 22 32 42 52
4  2009-10-01 03:00:00 13 23 33 43 53
5  2009-10-01 04:00:00 14 24 34 44 54
6  2009-10-01 05:00:00 15 25 35 45 55
7  2009-10-01 06:00:00 16 26 36 46 56
8  2009-10-01 07:00:00 17 27 37 47 57
9  2009-10-01 08:00:00 18 28 38 48 58
10 2009-10-01 09:00:00 19 29 39 49 59
11 2009-10-01 10:00:00 20 30 40 50 60
12 2009-10-01 11:00:00 21 31 41 51 61
13 2009-10-01 12:00:00 22 32 42 52 62
14 2009-10-01 13:00:00 23 33 43 53 63
15 2009-10-01 14:00:00 24 34 44 54 64
16 2009-10-01 15:00:00 25 35 45 55 65
17 2009-10-01 16:00:00 26 36 46 56 66
18 2009-10-01 17:00:00 27 37 47 57 67
19 2009-10-01 18:00:00 28 38 48 58 68
20 2009-10-01 19:00:00 29 39 49 59 69
21 2009-10-01 20:00:00 30 40 50 60 70
22 2009-10-01 21:00:00 31 41 51 61 71
23 2009-10-01 22:00:00 32 42 52 62 72
24 2009-10-01 23:00:00 33 43 53 63 73
25 2009-10-02 00:00:00 34 44 54 64 74

and another data frame "events" with different time periods defined by a start and stop date (3 here, many more in the real data):

   id       start                stop
1 AGH 2009-10-01 02:00:00 2009-10-01 04:00:00
2 TRG 2009-10-01 03:00:00 2009-10-01 10:00:00
3 ZUH 2009-10-01 03:00:00 2009-10-01 20:00:00

I would like to get a table of the mean values of the variables within the different events like this:

   id avg(y.a) avg(y.b) avg(y.c) avg(y.d) avg(y.e)
1 AGH     13.0     23.0     33.0     43.0     53.0
2 TRG     16.5     26.5     36.5     46.5     56.5
3 ZUH     21.5     31.5     41.5     51.5     61.5

I have learned from my previous post that I can do this using the sqldf package and a rather simple SQL statement:

means <- sqldf("
+     SELECT x.id, avg(y.a), avg(y.b), avg(y.c), avg(y.d), avg(y.e) 
+     FROM events as x, data as y 
+     WHERE y.date between x.start and x.stop 
+     GROUP BY x.id 
+ ")

However, as the real data contains many more columns to average, which are named differently in the various files I have to process, typing all the column names into the SQL statements becomes a bit tedious.

Thus I would prefer a solution in R, where I can simply refer to the columns by their number (data[2:100]) The difficulty is, though, that the time periods are non-continous and overlapping and the ids are character strings.

Any ideas how to do this would be much appreciated!

dput(data)

structure(list(date = structure(c(1254348000, 1254351600, 1254355200, 
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800, 
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400, 
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000, 
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct", 
"POSIXt"), tzone = "Europe/Berlin"), a = 10:34, b = 20:44, c = 30:54, 
    d = 40:64, e = 50:74), .Names = c("date", "a", "b", "c", 
"d", "e"), row.names = c(NA, -25L), class = "data.frame")

dput(events)

structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800, 
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")
Community
  • 1
  • 1
cin
  • 91
  • 6

2 Answers2

2
  1. The basic problem is due to the fact that the data is not normalized; however, short of putting it into long form we could dynamically generate the sql statement:

    library(sqldf)
    sql <- paste("select id, ", 
        toString(sprintf("avg(y.%s)", names(data)[-1])),
        "from events as x, data as y
        where y.date between x.start and x.stop
        group by x.id")
    sqldf(sql)
    
  2. As an alternative, we show the use of melt in the reshape2 package to convert the data to long form, data_long, process it to give means.long and convert it back to wide form using dcast :

    library(reshape2)
    data_long <- melt(data, id.vars = "date")
    means_long <- sqldf("
         SELECT x.id, y.variable, avg(value)
         FROM events as x, data_long as y 
         WHERE y.date between x.start and x.stop 
         GROUP BY x.id, y.variable
    ")
    means <- dcast(id ~ variable, data = means_long, value.var = "avg(value)")
    
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I don't really understand why the data should be normalized, but sprintf is an interesting function. Thanks! – cin Jun 30 '12 at 18:04
  • Have added an example of converting it to long form (normalized) and processing it in that form and then converting back to wide form. – G. Grothendieck Jun 30 '12 at 20:06
  • @cin: I think the sqldf function is much more interesting than plain old `sprintf`. (G. Grothendieck is a much more capable R programmer than I and he wrote that package .... which needed to be loaded. – IRTFM Jun 30 '12 at 22:06
  • @G.Grothendieck: I do understand now. This is another nice approach. I am just beginning to discover the power of R. Thanks! – cin Jul 01 '12 at 10:29
  • @DWin: I knew about the sqldf package already. It's great, I agree. As I am not a programmer at all, though, I guess there will be many more interesting "plain old" functions for me to find. – cin Jul 01 '12 at 10:45
1
>  t( sapply(events$id , function(id) lapply (
            data[ data[["date"]] >= events[ events[['id']]==id, 'start'] & 
                  data[["date"]] <= events[ events[['id']]==id, 'stop' ] ,  -1 ], 
            mean) ) )
     a    b    c    d    e   
[1,] 13   23   33   43   53  
[2,] 16.5 26.5 36.5 46.5 56.5
[3,] 21.5 31.5 41.5 51.5 61.5
#  Or if you prefer:
t( sapply(events$id , function(id) data.frame( 
                       id=as.character(id), 
                       lapply (data[ data[["date"]] >= events[events[['id']]==id, 'start'] &  
                                     data[["date"]] <= events[ events[['id']]==id, 'stop' ] , -1 ],
                               mean) ,stringsAsFactors=FALSE) ) )
     id    a    b    c    d    e   
[1,] "AGH" 13   23   33   43   53  
[2,] "TRG" 16.5 26.5 36.5 46.5 56.5
[3,] "ZUH" 21.5 31.5 41.5 51.5 61.5
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks for your quick answer! Looks promising, though I will need some more time to try and fully understand it... Any chance I can have the id names in a first column like in my result table above? – cin Jun 29 '12 at 21:00
  • @cin convert the output to a data.frame et voila. – Roman Luštrik Jun 30 '12 at 07:43