Average a big irregularly spaced time series in 10 minutes intervals to produce a 24 hour profile

Question

I'm having trouble when trying to calculate the average temperature by 10 minutes but within 24 hours vector.

I have a time series stored in dataframe in a proper POSIX format. The only problem is that data is irregularly spaced (10 +-3 min).

I know how to average them in hours, days, months years but I need to get average for a profile of 24, 72 or 168 hours.

For example, for 168 hours profile I want to average every Monday in observation period at 00:00:00, then at 00:10:00, 00:20:00 etc., then every Tuesday, Wednesday etc.

So my data averages must fit into a regular 24/72/168 vector.

For 24 hours vector will be defined like this:

seq(ISOdatetime(2001,2,3,0,0,0), ISOdatetime(2001,2,4,0,0,0), by=(60*5))

Resulting in a regular 24 hours vector source of this solution here

  [1] "2001-02-03 00:00:00 PST" "2001-02-03 00:05:00 PST"
  [3] "2001-02-03 00:10:00 PST" "2001-02-03 00:15:00 PST"
  [5] "2001-02-03 00:20:00 PST" "2001-02-03 00:25:00 PST"
  [7] "2001-02-03 00:30:00 PST" "2001-02-03 00:35:00 PST"
  [9] "2001-02-03 00:40:00 PST" "2001-02-03 00:45:00 PST"

Problem is that timestamp of my data is shifting with each day as you can see from sample below. On 2016-09-01 first reading (supposed to be at 00:00:00) is at 00:01:00, next day it is 00:04:00, next day 00:07:00 and so on.

I tried xts and zoo with no success as the limit of aggregation is hour and I need to define it in minutes.

The multiple answers I have found deal with averaging timeseries through continuous set of data (example1, example2.

Unfortunately I couldn't find an answer regarding my problem.

Structure of sample database:

'data.frame':   9490 obs. of  2 variables:
 $ Date_Time_Stamp : POSIXct, format: "2016-09-01 00:01:00" "2016-09-01 00:11:00" "2016-09-01 00:22:00" "2016-09-01 00:32:00" ...
 $ Signal_Raw_Value: num  778 694 592 523 567 ...

My data look like this (head)

Date_Time_Stamp Signal_Raw_Value
1 2016-09-01 00:01:00           777.51
2 2016-09-01 00:11:00           694.38
3 2016-09-01 00:22:00           591.69
4 2016-09-01 00:32:00           523.23
5 2016-09-01 00:42:00           567.24
6 2016-09-01 00:52:00           547.68

Tail:

Date_Time_Stamp Signal_Raw_Value
9485 2016-11-06 23:02:00           660.15
9486 2016-11-06 23:12:00           635.70
9487 2016-11-06 23:22:00           498.78
9488 2016-11-06 23:32:00           415.65
9489 2016-11-06 23:42:00           425.43
9490 2016-11-06 23:53:00           440.10

First hour of 2016-09-01

Date_Time_Stamp Signal_Raw_Value
1    2016-09-01 00:01:00           777.51
2    2016-09-01 00:11:00           694.38
3    2016-09-01 00:22:00           591.69
4    2016-09-01 00:32:00           523.23
5    2016-09-01 00:42:00           567.24
6    2016-09-01 00:52:00           547.68
7    2016-09-01 01:02:00           562.35

First hour of next day (2016-09-02)

143  2016-09-02 00:04:00           557.46
144  2016-09-02 00:14:00           557.46
145  2016-09-02 00:24:00           562.35
146  2016-09-02 00:35:00           552.57
147  2016-09-02 00:45:00           503.67
148  2016-09-02 00:55:00           484.11
149  2016-09-02 01:05:00           454.77

First hour of third day in row (2016-09-03)

285  2016-09-03 00:07:00           655.26
286  2016-09-03 00:17:00           537.90
287  2016-09-03 00:27:00           464.55
288  2016-09-03 00:38:00           454.77
289  2016-09-03 00:48:00           425.43
290  2016-09-03 00:58:00           420.54
291  2016-09-03 01:08:00           400.98

And a fourth day:

426  2016-09-04 00:00:00           865.53
427  2016-09-04 00:10:00           723.72
428  2016-09-04 00:20:00           621.03
429  2016-09-04 00:30:00           562.35
430  2016-09-04 00:40:00           493.89
431  2016-09-04 00:51:00           459.66
432  2016-09-04 01:01:00           435.21

After processing raw value of signal I need to produce something like this: 24 hrs profile and 168 hrs: 168 hrs profile.

Thank you!

One way of solving this problem is to use the seq function to create an array of 10 minute intervals from the start of your data until the end. Use the cut function with your newly created array as the break parameter. Now you can aggregate but the newly defined cut values — Dave2e, Feb 11 '17 at 03:24
I have a partial success with following (elegant?) solution: aggregate (qxts, format (index (qxts), "%H"), mean) This produces hourly averages for every hour. I'm stuck with changing interval to 10 or 15 minutes. Any ideas how to move on? — Patryk, Mar 17 '17 at 19:56

Dave2e · Answer 1 · 2017-03-20T23:12:45.933

1

Since you are still struggling with this problem try this:

#Create sample data
 #create a sequence of random times (about 10 minutes apart) 
 rtime <-as.integer(rnorm(1000, 10, 2))
 Date_Time_Stamp<- as.POSIXct("2016-09-01")
 Date_Time_Stamp<-Date_Time_Stamp+cumsum(rtime)*60
 Signal_Raw_Value <- rnorm(1000, 600, 20)
 df<-data.frame(Date_Time_Stamp, Signal_Raw_Value)
#End of sample data creation

#Calclated the number of minutes since midnight
df$minutes<-as.integer(format(df$Date_Time_Stamp, "%H"))*60 + as.integer(format(df$Date_Time_Stamp, "%M"))
#break into 144 intervals per day
df$mybreaks<-cut(df$minutes, breaks = seq(0, 1440, 10), include.lowest = TRUE)

#Using dplyr 
library(dplyr)
#find mean of each group
summarise( group_by(df, mybreaks), mean(Signal_Raw_Value))
#find number of elements in each grouping
summarise( group_by(df, mybreaks), n())

Your problem statement was not very clear. Here is a solution which will break the day into 144 10-minutes (1440 minutes/day) segments and averages the data across the entire dataset into those 144 intervals.

edited Mar 20 '17 at 23:12

answered Mar 18 '17 at 14:46

Dave2e

22,192
18
42
50

Thank you very much for your help. I have tested your solution but in result I am getting single mean value for the whole data set. – Patryk Mar 20 '17 at 12:44
Moreover: in last script I'm getting error in n() - this function should not be called directly. I think problem is with structure of 'mybreaks' - it is factor and should be date format. Tried to fix it but I'm getting errors. Will report if I solve it. – Patryk Mar 20 '17 at 13:20
Part of solution is that I put summarise instead of summarise_. Now I am getting 10 min interval values but calculated mean is constant and it failed to group it by interval - I am getting the whole list of 884 rows - should be 144 for 10 min interval. – Patryk Mar 20 '17 at 13:28
@Dave2eAs a temporary solution I have applied align.time and created equal time indices. Then I could aggregate it with xts type: 'code' aggregate (qxts, format (index (qxts), "%H"), mean) 'code' – Patryk Mar 20 '17 at 15:08
Working without an actual sample of the data and an example of the desired output leads these types of problems. I sorry, but your question, was not very clear. Do you want the mean of each 10 min interval across all 4 days? – Dave2e Mar 20 '17 at 15:09
I would like to decide on horizon of aggregation. Default is 24 hours = diurnal pattern: Example here [link] (https://www.researchgate.net/publication/266503942_Smart_meter_enabled_water_end-use_demand_data_Platform_for_the_enhanced_infrastructure_planning_of_contemporary_urban_water_supply_networks/figures?lo=1) – Patryk Mar 20 '17 at 15:11
Correct script to aggregate using minute index: `aggregate (qxts, format (index (qxts), "%H:%M"), mean) ` It is producing mean in 144 intervals (24h/10 mins). The main problem with using align.time is that timestamp is shifted, but I don't need it to be super precise at this stage. Still looking for solution to average by equal interval for any horizon (from 24 hrs up to 168hrs). – Patryk Mar 20 '17 at 15:14
"Do you want the mean of each 10 min interval across all 4 days?" I am sorry I was not precise: I want to have 10 min interval across 24 hours (average day) and choice to extend averaging period to 48, 72 and 168 hrs (average week). Please see ref [4] and [5] of actual results I am expecting. – Patryk Mar 20 '17 at 15:24
Thank you @Dave2e, but this is solving my problem: `summarise( group_by(df, mybreaks), mean(Signal_Raw_Value)) mean(Signal_Raw_Value)` Result: `1 600.3978` – Patryk Mar 24 '17 at 19:55
However after fixing it to `summarise_( group_by(df, mybreaks), mean(Signal_Raw_Value)) mean(Signal_Raw_Value)` I am getting result grouped into intervals:`# A tibble: 144 × 2 mybreaks `600.397842660427` 1 [0,10] 600.3978 2 (10,20] 600.3978 3 (20,30] 600.3978 4 (30,40] 600.3978 5 (40,50] 600.3978 6 (50,60] 600.3978 7 (60,70] 600.3978 8 (70,80] 600.3978 9 (80,90] 600.3978 10 (90,100] 600.3978` – Patryk Mar 24 '17 at 19:58
`summarise_` is using actual `dplyr` library, while `summarise` is from `plyr` library. I don't understand why I am getting this error with last line of your script: `summarise_( group_by(df, mybreaks), n()) Error in n() : This function should not be called directly` – Patryk Mar 24 '17 at 20:03
Hopefully you got the answer you were looking for. The last line is just for informational purposes, I am not sure why you have having a conflict unless you have both plyr and dplyr loaded at the same time. FYI functions length and n() should generate the same answer. – Dave2e Mar 24 '17 at 20:56

Average a big irregularly spaced time series in 10 minutes intervals to produce a 24 hour profile

1 Answers1