Aggregate 5 minute data to hourly sums with NA's

Question

My problem is as follows: I've got a time series with 5-Minute precipitation data like:

            Datum mm
1 2004-04-08 00:05:00 NA
2 2004-04-08 00:10:00 NA
3 2004-04-08 00:15:00 NA
4 2004-04-08 00:20:00 NA
5 2004-04-08 00:25:00 NA
6 2004-04-08 00:30:00 NA

With this structure:

'data.frame':   1098144 obs. of  2 variables:
$ Datum: POSIXlt, format: "2004-04-08 00:05:00" "2004-04-08 00:10:00"   "2004-04-08 00:15:00" "2004-04-08 00:20:00" ...
$ mm   : num  NA NA NA NA NA NA NA NA NA NA ...

As you can see, the time series begins with a lot of NA's, but there is measured precipitation further down, although riddled with single, less common NA's due to malfunction of the measuring station.

What I'm trying to achieve, is summing up the measured precipitation to hourly sums, not considering NA's.

This is what I tried so far:

sums <- aggregate(precip["mm"], 
               list(cut(precip$Datum, "1 hour")), sum)

Even though the timestamps are correctly aggregated to hours, all sums are 0 or NA. The sums are not even calculated if there is no NA at all.

additionally to be taken into account:

Hourly precipitation sums in meteorology always describe the cumulative sum until a certain hour: The amount of precipitation at 0:00 o'clock describes the sum from 23:00 the previous day until 0:00. So I always need to sum up the previous hour.

Reproducible Example

set.seed(1120)
s <- as.POSIXlt("2004-03-08 23:00:00")
r <- seq(s, s+1e4, "30 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 6, T))

            Datum mm
2004-03-08 23:00:00  4
2004-03-08 23:30:00  1
2004-03-09 00:00:00  2
2004-03-09 00:30:00  4
2004-03-09 01:00:00  1
2004-03-09 01:30:00  4

With the above example, the result I am looking for is:

            Datum mm
2004-03-09 00:00:00 5
2004-03-09 01:00:00 6
2004-03-09 02:00:00 5

For the future, please consider reading up on [ask] and how to create a [reproducible example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It makes it easier for others to help you without the need for a lot of clarification-comments if you provide clear expected output. — Heroka, Nov 20 '15 at 15:34
In the example data frame, what should the output be? Please answer with *actual code* not a written description of the output. — Pierre L, Nov 20 '15 at 15:52
Did we really go back and forth this long and you had the solution the whole time? My original solution produces that output. — Pierre L, Nov 20 '15 at 16:30
No, sry it doesn't. In my desired result the sum of `2004-03-09 00:00:00` (`= 5`) is the addition of `4+1` of 23:00 and 23:30. It is shifted. — M. Weeker, Nov 20 '15 at 16:38
Possible duplicate of [Adding the values of second column based on date and time of first column](http://stackoverflow.com/questions/32097282/adding-the-values-of-second-column-based-on-date-and-time-of-first-column) — germcd, Nov 20 '15 at 16:43

Pierre L · Accepted Answer · 2015-11-20T16:30:56.977

3

Try adding na.rm=TRUE:

aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
#               Group.1 mm
# 1 2004-04-08 00:00:00 26
# 2 2004-04-08 01:00:00 35
# 3 2004-04-08 02:00:00 25

Reproducible Example

set.seed(1120)
s <- as.POSIXlt("2004-04-08 00:05:00")
r <- seq(s, s+1e4, "5 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 34, T))

addendum

To your second question: If you would like measurements on the hour to be calculated with the lesser hour add right=TRUE:

aggregate(precip['mm'], list(cut(precip$Datum, "1 hour", right=TRUE)), sum, na.rm=TRUE)

Further Explanation

We will create another more detailed explanation to show how the solution works:

p <- c("2004-04-07 23:48:20", "2004-04-08 00:00:00", "2004-04-08 00:03:20")
ptime <- as.POSIXlt(p)
#[1] "2004-04-07 23:48:20 EDT" "2004-04-08 00:00:00 EDT" "2004-04-08 00:03:20 EDT"

We have three dates to separate into groups. If we use cut without any extra arguments, the second entry "2004-04-08 00:00:00 EDT" will be grouped with the third entry for hour "00:00":

cut(ptime, "1 hour")
#[1] 2004-04-07 23:00:00 2004-04-08 00:00:00 2004-04-08 00:00:00

But if we add the argument right=FALSE we can group it with the "23:00" hour:

cut(ptime, "1 hour", right=TRUE)
#[1] 2004-04-07 23:00:00 2004-04-07 23:00:00 2004-04-08 00:00:00

We can specify the behavior of edge cases.

edit

With your new data the original solution produces the desired output:

aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
              Group.1 mm
1 2004-03-08 23:00:00  5
2 2004-03-09 00:00:00  6
3 2004-03-09 01:00:00  5

edited Nov 20 '15 at 16:30

answered Nov 20 '15 at 13:35

Pierre L

28,203
6
47
69

your Solution is succesfully deleting all the `NA's` but still, no sum is being calculated, every hour has a sum of `0`. – M. Weeker Nov 20 '15 at 13:42
Have you tried the reproducible example? If your data differs, please indicate where. – Pierre L Nov 20 '15 at 13:43
1

Ok, thank you very much! Thanks to your example I realized, that the preciptiation column was not loaded correctly as a numeric variable, as the decimal seperator was a `,`... germans...! – M. Weeker Nov 20 '15 at 14:01
Sorry to bother you again, but now - to do it correctly - the sum of e.g. 18:00 is the sum from 18:00 - 19:00, which actually should be the sum from 17:00 - 18:00. – M. Weeker Nov 20 '15 at 14:11
Second question addressed in answer – Pierre L Nov 20 '15 at 14:34
`right = TRUE` unfortunately does not give the correct result. The sum of 0:00 sums up from 23:00 to 0:00 and so on, cumulating the sum _until_ 0:00. – M. Weeker Nov 20 '15 at 14:52
2

Your expected output is unclear. Just as we did with the first question, I would like to stress the importance of creating examples. If you create a small example with your desired output, it is so much easier to help. I think you know at this point that I am honestly trying to assist you. Please help me by adding a specific example to your original post. – Pierre L Nov 20 '15 at 14:55
I edited my post though I don't know how to be more specific. Thank you very much for your great effort. – M. Weeker Nov 20 '15 at 15:46
You're welcome. I think you are confused by what I mean by "an example". I will add a sample data frame to your question. Please respond with the correct answer that should result. – Pierre L Nov 20 '15 at 15:50
@candles_and_oranges Hi, I also adapt this function to aggregate 5-Minute data to daily sums, but now the problem is I found R adapt the local timezone directly, which means the daily sum values are at 1:00 am every day during summertime. I want to change it and get all my daily sum at 00:00 o'clock. I was wondering how to change the code to realize this then? – LEE Sep 09 '20 at 16:52

score 1 · Answer 2 · answered Nov 20 '15 at 14:00

1

You can use dplyr to calculate sum like :

precip$hour <-  strftime(precip$Datum,"%Y-%m-%d %H")
library(dplyr)
sum_hour <- precip %>% group_by(hour) %>% summarise(sum_hour = sum(mm,na.rm = T))

answered Nov 20 '15 at 14:00

vdep

3,541
4
28
54

Thanks for your answer, but neither it considers my last comment to @PierreLafortune 's answer. Looking forward to your solution! – M. Weeker Nov 20 '15 at 14:13
do you mean if the hour is say 4:00, should it calculate for (3:00 to 4:00) ?, if so what will you do for 0:00 ? – vdep Nov 20 '15 at 14:32
yes, the sum of 4:00 should be 3:00-4:00 and the sum of 0:00 should then be the sum of 23:00-0:00... now this is tricky. – M. Weeker Nov 20 '15 at 14:36
1

23:00 from previous day ? – vdep Nov 20 '15 at 14:37

Aggregate 5 minute data to hourly sums with NA's

2 Answers2

Linked