4

I have a dataset filled with the average windspeed per hour for multiple years. I would like to create an 'average year', in which for each hour the average windspeed for that hour over multiple years is calculated. How can I do this without looping endlessly through the dataset? Ideally, I would like to just loop through the data once, extracting for each row the right month, day, and hour, and adding the windspeed from that row to the right row in a dataframe where the aggregates for each month, day, and hour are gathered. Is it possible to do this without extracting the month, day, and hour, and then looping over the complete average-year data.frame to find the right row?

Some example data:

data.multipleyears <- data.frame(
 DATETIME = c("2001-01-01 01:00:00", "2001-05-03 09:00:00", "2007-01-01 01:00:00", "2008-02-29 12:00:00"),
 Windspeed = c(10, 5, 8, 3)
)

Which I would like to aggregate in a dataframe like this:

average.year <- data.frame(
 DATETIME = c("01-01 00:00:00", "01-01 01:00:00", ..., "12-31 23:00:00")
 Aggregate.Windspeed = (100, 80, ...)
)

From there, I can go on calculating the averages, etc. I have probably overlooked some command, but what would be the right syntax for something like this (in pseudocode):

 for(i in 1:nrow(data.multipleyears) {
  average.year$Aggregate.Windspeed[
   where average.year$DATETIME(month, day, hour) == data.multipleyears$DATETIME[i](month, day, hour)]  <- average.year$Aggregate.Windspeed + data.multipleyears$Windspeed[i]
 }

Or something like that. Help is appreciated!

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
Gerrit Jan
  • 349
  • 4
  • 11

3 Answers3

3

I predict that ddply and the plyr package are going to be your best friend :). I created a 30 year dataset with hourly random windspeeds between 1 and 10 ms:

begin_date = as.POSIXlt("1990-01-01", tz = "GMT")
# 30 year dataset
dat = data.frame(dt = begin_date + (0:(24*30*365)) * (3600))
dat = within(dat, {
  speed = runif(length(dt), 1, 10)
  unique_day = strftime(dt, "%d-%m")
})
> head(dat)
                   dt unique_day    speed
1 1990-01-01 00:00:00      01-01 7.054124
2 1990-01-01 01:00:00      01-01 2.202591
3 1990-01-01 02:00:00      01-01 4.111633
4 1990-01-01 03:00:00      01-01 2.687808
5 1990-01-01 04:00:00      01-01 8.643168
6 1990-01-01 05:00:00      01-01 5.499421

To calculate the daily normalen (30 year average, this term is much used in meteorology) over this 30 year period:

library(plyr)
res = ddply(dat, .(unique_day), 
            summarise, mean_speed = mean(speed), .progress = "text")
> head(res)
  unique_day mean_speed
1      01-01   5.314061
2      01-02   5.677753
3      01-03   5.395054
4      01-04   5.236488
5      01-05   5.436896
6      01-06   5.544966

This takes just a few seconds on my humble two core AMD, so I suspect just going once through the data is not needed. Multiple of these ddply calls for different aggregations (month, season etc) can be done separately.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Awesome! I used your example and got the right result. Now I'm going to try to actually understand your code... I heard of the plyr package, but couldn't find out how to use it. Thanks! – Gerrit Jan Apr 04 '12 at 09:32
  • Take a look at www.jstatsoft.org/v40/i01 for paper on plyr. It explains plyr very well. – Paul Hiemstra Apr 04 '12 at 09:56
2

You can use substr to extract the part of the date you want, and then use tapply or ddply to aggregate the data.

tapply(
  data.multipleyears$Windspeed, 
  substr( data.multipleyears$DATETIME, 6, 19), 
  mean 
)
# 01-01 01:00:00 02-29 12:00:00 05-03 09:00:00 
#              9              3              5 

library(plyr)
ddply(
  data.multipleyears, 
  .(when=substr(DATETIME, 6, 19)), 
  summarize, 
  Windspeed=mean(Windspeed)
)
#             when Windspeed
# 1 01-01 01:00:00         9
# 2 02-29 12:00:00         3
# 3 05-03 09:00:00         5
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • 1
    I am really fond of converting DATETIME to a POSIXlt object and than creating a new column using strftime. See my example below. – Paul Hiemstra Apr 04 '12 at 09:17
  • Thanks for the answer, Vincent. I actually used Paul's answer as my solution, but I guess yours will work just as fine -- but I'd still have to loop over all of the dateparts, right? That is not necessary in Paul's answer. – Gerrit Jan Apr 04 '12 at 09:33
  • No, no loop is needed. The only difference with Paul's answer is that he used POSIXct objects (that is cleaner), while I used strings. – Vincent Zoonekynd Apr 04 '12 at 09:43
0

It is pretty old post, but I wanted to add. I guess timeAverage in Openair can also be used. In the manual, there are more options for timeAverage function.

Community
  • 1
  • 1
Çelik Termos
  • 41
  • 1
  • 1
  • 4