5

I am wondering if there is a way to average daily data into weekly data. The dataframe that I call CADaily looks like this:

      > CADaily[1:10, ]
          Climate_Division       Date      Rain
      885                1 1948-07-01 0.8750000
      892                1 1948-07-02 2.9166667
      894                1 1948-07-03 0.7916667
      895                1 1948-07-04 0.4305556
      898                1 1948-07-05 0.8262061
      901                1 1948-07-06 0.5972222
      904                1 1948-07-17 0.04166667
      905                1 1948-07-18 0.08333333
      907                1 1948-07-20 0.04166667
      909                1 1948-07-22 0.12500000
      910                1 1948-07-21 NA

My objective is similar to the aggregate function to find the average of the daily rain into weekly rain values base on the Date (of course) and the Climate_Division (ranges from 1 to 7). I was searching online and I came across a code that I was able to use but wasn't quite to par to my objective:

      apply.weekly(xts(CADaily[,-2], order.by= CADaily[,2]), FUN = mean)

This does what I would like it to do, however my column Climate_Division is also averaged. I would simply like to average Rain only and order it based on the Climate_Division followed by Date. Is there a way that I can possibly do it as follows:

      aggregate(CADaily, by =list(CADaily$Climate_Division, CADaily$Date), FUN = mean, na.rm = TRUE)

where Date is in some form of weeks? Or is there another way?


EDIT:

Dear All,

Thank you for your help. Perhaps using aggregate wasn't the best way to go about this as I originally thought. In terms of output, I wanted to obtain the weekly average of rain throughout the years of the data (1948 - 1995). In other words, I wanted to get a nice format that I can input into a time series that has the form of the date of the end of the week. The output that I am looking for (keeping in mind that there may exist NA values) is:

      Climate_Division     Date          Rain
      1                    1948-07-03    1.527778
      1                    1948-07-10    0.6179946
      1                    1948-07-17    0.04166667
      1                    1948-07-24    0.08333333
      ...
      1                    1995-12-23    0.24513245
      1                    1995-12-30    0.12450545

Or is there a better way of expressing weekly data that is represented by dates?

Thank you for your help.

Luciano Rodriguez
  • 719
  • 4
  • 9
  • 16

4 Answers4

4

Updated answer

Based on the OP's update on the request, I modified the code to aggregate the data over the date of a defined day of each week (Saturday). This time I only use functions available in base R. It ignores NAs (if there are only NAs for a given End_of_Week-Climate_Division you get NaN, not a number).

# Data with another Climate division as example (same daily values and dates)
CADaily <-
structure(list(Climate_Division = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), Date = structure(c(1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L), .Label = c("01/07/1948", "02/07/1948", "03/07/1948", 
"04/07/1948", "05/07/1948", "06/07/1948", "17/07/1948", "18/07/1948", 
"20/07/1948", "22/07/1948"), class = "factor"), Rain = c(0.875, 
2.9166667, 0.7916667, 0.4305556, 0.8262061, 0.5972222, 0.04166667, 
0.08333333, 0.04166667, 0.125, 0.875, 2.9166667, 0.7916667, 0.4305556, 
0.8262061, 0.5972222, 0.04166667, 0.08333333, 0.04166667, 0.125
), week = c(27, 27, 27, 27, 27, 27, 29, 29, 29, 30, 27, 27, 27, 
27, 27, 27, 29, 29, 29, 30)), .Names = c("Climate_Division", 
"Date", "Rain", "week"), row.names = c(NA, 20L), class = "data.frame")

# Coerce to Date class
CADaily$Date <- as.Date(x=CADaily$Date, format='%d/%m/%Y')

# Extract day of the week (Saturday = 6)
CADaily$Week_Day <- as.numeric(format(CADaily$Date, format='%w'))

# Adjust end-of-week date (first saturday from the original Date)
CADaily$End_of_Week <- CADaily$Date + (6 - CADaily$Week_Day)

# Aggregate over week and climate division
aggregate(Rain~End_of_Week+Climate_Division, FUN=mean, data=CADaily, na.rm=TRUE)

# Output
#   End_of_Week Climate_Division       Rain
# 1  1948-07-03                1 1.52777780
# 2  1948-07-10                1 0.61799463
# 3  1948-07-17                1 0.04166667
# 4  1948-07-24                1 0.08333333
# 5  1948-07-03                2 1.52777780
# 6  1948-07-10                2 0.61799463
# 7  1948-07-17                2 0.04166667
# 8  1948-07-24                2 0.08333333

Additional operations

Also, using this code you can obtain results from additional aggregation functions, assuming the result is an atomic vector of the same length for every week-division pair.

# Aggregate over week and climate division, and show the total number of
# observations per week, the number of observations which represent missing
# values, the average, and the standard deviation.
aggregate(Rain~End_of_Week+Climate_Division, data=CADaily,
          FUN=function(x) c(n=length(x),
                            NAs=sum(is.na(x)),
                            Average=mean(x, na.rm=TRUE),
                            SD=sd(x, na.rm=TRUE)))

# Output. You get NA for the standard deviation if there is only one observation.
#   End_of_Week Climate_Division     Rain.n   Rain.NAs Rain.Average    Rain.SD
# 1  1948-07-03                1 3.00000000 0.00000000   1.52777780 1.20353454
# 2  1948-07-10                1 3.00000000 0.00000000   0.61799463 0.19864151
# 3  1948-07-17                1 1.00000000 0.00000000   0.04166667         NA
# 4  1948-07-24                1 3.00000000 0.00000000   0.08333333 0.04166667
# 5  1948-07-03                2 3.00000000 0.00000000   1.52777780 1.20353454
# 6  1948-07-10                2 3.00000000 0.00000000   0.61799463 0.19864151
# 7  1948-07-17                2 1.00000000 0.00000000   0.04166667         NA
# 8  1948-07-24                2 3.00000000 0.00000000   0.08333333 0.04166667



Original answer

Try with the lubridate package. Load it, and then aggregate (kept for the record as part of the original answer, which reflected the OP's request to aggregate by week).

# Load lubridate package
library(package=lubridate)

# Set Weeks number. Date already of class `Date`
CADaily$Week <- week(CADaily$Date)

# Aggregate over week number and climate division
aggregate(Rain~Week+Climate_Division, FUN=mean, data=CADaily, na.rm=TRUE)

# Output
#   Week Climate_Division       Rain
# 1   27                1 1.07288622
# 2   29                1 0.05555556
# 3   30                1 0.12500000
# 4   27                2 1.07288622
# 5   29                2 0.05555556
# 6   30                2 0.12500000
Community
  • 1
  • 1
Oscar de León
  • 2,331
  • 16
  • 18
  • Dear Oscar, Thank you for your help. Perhaps using aggregate wasn't the best way to go about this as I originally thought. In terms of output, I wanted to obtain the weekly average of rain throughout the years of the data (1948 - 1995). In other words, I wanted to get a nice format that I can input into a time series that has the form of the date of the end of the week. For example: – Luciano Rodriguez Feb 27 '13 at 08:07
  • Sorry, for not fitting everything in the above comment, I'll edit it on the main question. – Luciano Rodriguez Feb 27 '13 at 08:14
  • Ah, using the end of the week instead of the week number of the year is easy, and we will dispense of using non-base package. Basically you should get the weekday for each date, and add to the date the number of days left until the Saturday of that week. I'll update the answer when I get to work later. – Oscar de León Feb 27 '13 at 10:26
  • Perfect! Does what I would like. Thank you for all your help Oscar. – Luciano Rodriguez Feb 27 '13 at 20:14
  • 1
    @OscardeLeón, with respect to your original answer I think it should be `week` not `Week` in the `CADaily$Week <- Week(CADaily$Date)`. – Konrad Apr 20 '15 at 12:45
1

xts is great for such manipulations. Use endpoints to subset data, then sapply to treat it weekly.

CADaily <- read.table(text ='     Climate_Division       Date      Rain
      885                1 1948-07-01 0.8750000
      892                1 1948-07-02 2.9166667
      894                1 1948-07-03 0.7916667
      895                1 1948-07-04 0.4305556
      898                1 1948-07-05 0.8262061
      901                1 1948-07-06 0.5972222
      904                1 1948-07-17 0.04166667
      905                1 1948-07-18 0.08333333
      907                1 1948-07-20 0.04166667
      909                1 1948-07-22 0.12500000',head=T)
dat.xts <- xts(CADaily[,-2], order.by= as.POSIXct(CADaily[,2]))
INDEX <- endpoints(dat.xts, 'weeks')

lapply(1:(length(INDEX) - 1), function(y) {
    y <- dat.xts[(INDEX[y] + 1):INDEX[y + 1]]
    data.frame(y$Climate_Division,mean(y$Rain))

  })

My result is a list by week:

[[1]]
           Climate_Division mean.y.Rain.
1948-07-01                1     1.168019
1948-07-02                1     1.168019
1948-07-03                1     1.168019
1948-07-04                1     1.168019
1948-07-05                1     1.168019

[[2]]
           Climate_Division mean.y.Rain.
1948-07-06                1    0.5972222

[[3]]
           Climate_Division mean.y.Rain.
1948-07-17                1       0.0625
1948-07-18                1       0.0625

[[4]]
           Climate_Division mean.y.Rain.
1948-07-20                1   0.08333334
1948-07-22                1   0.08333334
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • This is great, however, when I ran the full data using your algorithm I am across many warnings: Warning messages: 1: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 2,3,4,5,6,7,9,12,13,15,16 --> row.names NOT used using the command tail(x), where x is what I set equal to what is returned by the function lapply, I don't obtain the dates when all divisions are present at a certain date: – Luciano Rodriguez Feb 27 '13 at 08:32
  • [[140]] Climate_Division mean.y.Mean..na.rm...TRUE. 1948-05-28 17:00:00 2 4.020833 1948-05-29 17:00:00 2 4.020833 [[141]] Climate_Division mean.y.Mean..na.rm...TRUE. 1948-05-31 17:00:00 2 0.02083333 1948-06-05 17:00:00 2 0.02083333 – Luciano Rodriguez Feb 27 '13 at 08:33
  • [[142]] Climate_Division mean.y.Mean..na.rm...TRUE. 1 1 0.486447 2 2 0.486447 3 3 0.486447 4 4 0.486447 5 5 0.486447 6 6 0.486447 7 7 0.486447 8 1 0.486447 9 2 0.486447 10 1 0.486447 11 1 0.486447 – Luciano Rodriguez Feb 27 '13 at 08:33
  • @LucianoRodriguez I am lost !:) – agstudy Feb 27 '13 at 08:47
  • @ agstudy, sorry for getting you lost, I am not quite familiar on how to make the text nice within a comment. But the output that I am looking for is in the edited version of my main question. Let me know if that is still confusing : / – Luciano Rodriguez Feb 27 '13 at 08:52
  • I simply want to get the average weekly (say at the end of the week on Saturday) if the data is available and may contain NA values for all weeks from 1948 - 1995. – Luciano Rodriguez Feb 27 '13 at 08:54
1

just do:

library(tidyverse)
library(lubridate)

df <- df %>% 
  group_by(week = week(Date)) %>% #make sure 'Date' is a Date.object
  mutate("rain_mean" = mean(Rain))
arnle
  • 480
  • 4
  • 10
  • Please provide additional details in your answer. As it's currently written, it's hard to understand your solution. – Community Sep 02 '21 at 04:23
  • My understanding is that this solution would not work properly if the range of `Date` exceeds a solar year. To clarify, `week(ymd("2011-01-07")) == week(ymd("2012-01-07"))` returns `TRUE`. Is it possible to tweak your solution to accommodate this? – riccardo-df Nov 06 '22 at 19:49
0

I backtrack from my previous answer. I think this one is much simpler.

You just need to find what is coming weekend date for each row, and then aggregate

CADaily <- read.table(text = "Climate_Division       Date      Rain\n1 1948-07-01 0.8750000\n1 1948-07-02 2.9166667\n1 1948-07-03 0.7916667\n1 1948-07-04 0.4305556\n1 1948-07-05 0.8262061\n1 1948-07-06 0.5972222\n1 1948-07-17 0.04166667\n1 1948-07-18 0.08333333\n1 1948-07-20 0.04166667\n1 1948-07-22 0.12500000\n2 1948-07-01 0.8750000\n2 1948-07-02 2.9166667\n2 1948-07-03 0.7916667\n2 1948-07-04 0.4305556\n2 1948-07-05 0.8262061\n2 1948-07-06 0.5972222\n2 1948-07-17 0.04166667\n2 1948-07-18 0.08333333\n2 1948-07-20 0.04166667\n2 1948-07-22 0.12500000", 
    head = T)

CADaily$weekend <- as.POSIXlt(CADaily$Date) + (7 - as.POSIXlt(CADaily$Date)$wday) * 24 * 60 * 60

aggregate(Rain ~ weekend + Climate_Division, data = CADaily, FUN = mean)
##      weekend Climate_Division       Rain
## 1 1948-07-04                1 1.52777780
## 2 1948-07-11                1 0.61799463
## 3 1948-07-18                1 0.04166667
## 4 1948-07-25                1 0.08333333
## 5 1948-07-04                2 1.52777780
## 6 1948-07-11                2 0.61799463
## 7 1948-07-18                2 0.04166667
## 8 1948-07-25                2 0.08333333
CHP
  • 16,981
  • 4
  • 38
  • 57
  • Dear geektrader, Thank you for you input. However, looking at your output, the results are separated by division but they are not weekly values. For example, the first output is on 1948-07-05, while the second one is the next day 1948-07-06. So this is not necessarily weekly. Is there a way to do the average on Saturday of every week possibly? – Luciano Rodriguez Feb 27 '13 at 08:49
  • @LucianoRodriguez I have edited my answer do it in much simpler way working directly with data.frame itself – CHP Feb 27 '13 at 10:30
  • @geektrader Just as I suggested earlier. Try using `strptime` with `format='%w'`. That will give you the weekday as a number (Saturday=6) without changing to the second-based POSIXlt. – Oscar de León Feb 27 '13 at 10:51
  • 1
    @OscardeLeón from `?strptime` **strptime converts character vectors to class "POSIXlt": its input x is first converted by as.character.** – CHP Feb 28 '13 at 07:46
  • @geektrader you are quite right. thanks, and sorry for the misleading comment. – Oscar de León Feb 28 '13 at 13:00