5

I have been struggling with this for a while. I am new to working with ts data and all related R packages. I have a df with several variables including what 'time of day'in GMT "%H%M" and date "%Y/%m/%e" sampling occurred. I want to bin/aggregate my date data into "weeks" (i.e., %W/%g) and calculate the mean 'time of the day' when sampling occurred during that week.

I was able to calculate other FUN on numerical variables (e.g., weight) by first transforming my df into a zoo object and then using aggregate.zoo command as follow:

#calculate the sum weight captured every week 
x2c <- aggregate(OA_zoo, as.Date(cut(time(OA_zoo), "week")), sum)

However, I am not sure how to get around the fact that I am working with Date format rather than num and would appreciate any tips! Also, I have obviously been coding way to much by doing each of my variables separately. Would there be a way of applying different FUN (sum/mean/max/min) on my df by aggregating "weekly" using plyr? Or some other packages?

EDITS/CLARIFICATIONS Here's the dput output of a sample of my full dataset. I have data from 2004-2011. What I would like to look at/plot using ggplot2 is the mean/median of TIME (%H%M) aggregated in period of weeks over time (2004-2011). Right now, my data is not aggregated in week, but is daily (random sample).

> dput(godin)
structure(list(depth = c(878, 1200, 1170, 936, 942, 964, 951, 
953, 911, 969, 960, 987, 991, 997, 1024, 978, 1024, 951, 984, 
931, 1006, 929, 973, 986, 935, 989, 1042, 1015, 914, 984), duration = c(0.8, 
2.6, 6.5, 3.2, 4.1, 6.4, 7.2, 5.3, 7.4, 7, 7, 5.5, 7.5, 7.3, 
7.5, 7, 4.2, 3, 5, 5, 9.3, 7.9, 7.3, 7.2, 7, 5.2, 8, 6, 7.5, 
7), Greenland = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 40L, 28L, 0L, 
0L, 34L, 7L, 28L, 0L, 0L, 0L, 27L, 0L, 0L, 0L, 44L, 59L, 0L, 
0L, 0L, 0L, 0L, 0L), date2 = structure(c(12617, 12627, 12631, 
12996, 12669, 13036, 12669, 13036, 12670, 13036, 12670, 13037, 
12671, 13037, 12671, 13037, 12671, 13038, 12672, 13038, 12672, 
13038, 12672, 13039, 12631, 12997, 12673, 13039, 12673, 13039
), class = "Date"), TIME = c("0940", "0145", "0945", "2045", 
"1615", "0310", "2130", "1045", "0625", "1830", "1520", "0630", 
"0035", "1330", "0930", "2215", "2010", "0645", "0155", "1205", 
"0815", "1845", "2115", "0350", "1745", "0410", "0550", "1345", 
"1515", "2115")), .Names = c("depth", "duration", "Greenland", 
"date2", "TIME"), class = "data.frame", row.names = c("6761", 
"9019", "9020", "9021", "9022", "9023", "9024", "9025", "9026", 
"9027", "9028", "9029", "9030", "9031", "9032", "9033", "9034", 
"9035", "9036", "9037", "9038", "9039", "9040", "9041", "9042", 
"9043", "9044", "9045", "9046", "9047"))
GSee
  • 48,880
  • 13
  • 125
  • 145
GodinA
  • 1,053
  • 3
  • 17
  • 27

3 Answers3

3

I'd approach it like this: first make a column with a string representing the week:

godin$week <- format(godin$date2, "%Y-W%U")

this will give you something like "2004-W26", which will be good enough for aggregate.

then you need to turn your character vector that represents HHMM into an actual time, so that you can use time math on it.

godin$time2 <- as.POSIXct(strptime(godin$TIME, "%H%M"))

NOTE: the above is a bit of a hack...strptime() assumes the current date if none is specified, but that shouldn't get in the way of this particular application, since all converted times will have the same date, the time part of the mean will be correct. I'll strip off the date later...

At that point, I think you can simply aggregate:

x2c <- aggregate(time2~week, data=godin, FUN=mean)

and get rid of the irrelevant (and erroneous) date part

x2c$time2 <- format(x2c$time2,"%H:%M:%S")

et Voila.

> x2c
      week    time2
1 2004-W29 09:40:00
2 2004-W30 01:45:00
3 2004-W31 13:45:00
4 2004-W36 12:07:00
5 2004-W37 10:32:30
6 2005-W31 12:27:30
7 2005-W36 10:48:20
8 2005-W37 13:11:06

The lesson here is that its tricky to push around times with no associated dates in R. I'd love to hear from others who have a better way of doing this.

mac
  • 3,137
  • 1
  • 28
  • 42
  • @JoshuaUlrich: bah. you're right. How to convert a time with no date? – mac Jul 17 '12 at 17:58
  • edited using `strptime()` instead. still a bit of a hack, see answer, help if you can, thanks! – mac Jul 17 '12 at 18:09
  • `strptime` returns `POSIXlt` which is a list, so that's not going to work. Wrapping `as.POSIXct` around it would at least avoid the error, but I don't think the results will make sense. – GSee Jul 17 '12 at 18:22
  • @GSee: Thanks! I found this out the hard way as I was testing. Because of the specific application, the "nonsense" date part of the result doesn't get in the way. My newly edited response is tested and works...finally! – mac Jul 17 '12 at 18:29
2

You need to convert the TIME column into common units (i.e. minutes). Here are a couple helper functions to do that:

hour2min <- function(hhmm) {
  hhmm <- as.numeric(hhmm)
  trunc(hhmm/100)*60 + hhmm %% 100
}
min2hour <- function(min) {
  min <- as.numeric(min)
  trunc(min/60)*100 + min %% 60
}

Then you can aggregate the minutes however you'd like. I would personally put them in an xts object and use apply.weekly:

library(xts)
x <- xts(hour2min(godin$TIME), as.Date(godin$date2), dimnames=list(NULL,"MINS"))
w <- apply.weekly(x, mean)
w$TIME <- min2hour(w$MINS)
#                MINS     TIME
# 2004-07-18 580.0000  940.000
# 2004-08-01 585.0000  945.000
# 2004-09-12 711.2500 1151.250
# 2005-08-02 747.5000 1227.500
# 2005-09-11 746.6667 1226.667
# 2005-09-13 764.1667 1244.167

Seconds are still in 100ths of a minute, but could be changed...

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Thanks for your answer @Joshua Ulrich, but this wasn't exactly what I was looking for. See answers below. – GodinA Jul 17 '12 at 20:16
1

If you want the average of all observation times per week and you want dates to matter (i.e. 19:00 today and 19:00 tomorrow would average to 07:00 tomorrow morning), then you can do this

godin$datetime <- as.POSIXct(paste(godin$date2, godin$TIME), format="%Y-%m-%d %H%M")
aggregate(godin$datetime, list(format(godin$datetime, "%W/%g")), mean)

#  Group.1                   x
#1   28/04 2004-07-18 09:40:00
#2   30/04 2004-07-31 01:45:00
#3   31/05 2005-08-02 00:27:30
#4   36/04 2004-09-10 13:51:15
#5   36/05 2005-09-11 00:26:40
#6   37/05 2005-09-13 00:44:10

However, I think you want the average time, and you don't care about the date (other than for determining which week it is). In that case, you can use an arbitrary Date as your anchor, and treat all times like they occur on that Date.

godin$stime <- as.POSIXct(paste("1970-01-01", godin$TIME), format='%Y-%m-%d %H%M')
aggregate(godin$stime, list(format(godin$datetime, "%W/%g")), mean)

#  Group.1                   x
#1   28/04 1970-01-01 09:40:00
#2   30/04 1970-01-01 09:45:00
#3   31/05 1970-01-01 12:27:30
#4   36/04 1970-01-01 11:51:15
#5   36/05 1970-01-01 12:26:40
#6   37/05 1970-01-01 12:44:10

Edit

The difference between the above output and the output that @JoshuaUlrich provided is that his is sorted by week. If I sort them, it's like Joshua's, but with seconds instead of 100ths of minutes. (Although, it is a little tricky to sort them the same way because of the format specification that you chose.)

out <- aggregate(godin$stime, list(format(godin$datetime, "%W/%g")), mean)
out[order(as.numeric(paste0(substr(out[, 1], 4, 5), substr(out[, 1], 1, 2)))), ]
#  Group.1                   x
#1   28/04 1970-01-01 09:40:00
#2   30/04 1970-01-01 09:45:00
#4   36/04 1970-01-01 11:51:15
#3   31/05 1970-01-01 12:27:30
#5   36/05 1970-01-01 12:26:40
#6   37/05 1970-01-01 12:44:10

Edit 2

You can format inside aggregate if you just want to get back a string formatted like %H%M

out <- aggregate(godin$stime, list(format(godin$datetime, "%W/%g")), function(TIME) format(mean(TIME), "%H%M"))
out[order(as.numeric(paste0(substr(out[, 1], 4, 5), substr(out[, 1], 1, 2)))), ]
#  Group.1    x
#1   28/04 0940
#2   30/04 0945
#4   36/04 1151
#3   31/05 1227
#5   36/05 1226
#6   37/05 1244
GSee
  • 48,880
  • 13
  • 125
  • 145
  • @GodinA, don't let my base R answer fool you; If I were personally doing this, I would convert my data to `xts` before doing anything. – GSee Jul 17 '12 at 19:26
  • Curious to know, why your results differ from @mac? – GodinA Jul 17 '12 at 20:17
  • Because I used `%W/%g` like you asked ;-). Note that, for example, @mac treats 2004-07-28 as a different week than 2004-08-01 – GSee Jul 17 '12 at 20:22
  • If you replace `godin$week <- format(godin$date2, "%Y-W%U")` with `godin$week <- format(godin$date2, "%W/%g")` in @mac's answer, our results match – GSee Jul 17 '12 at 20:29