1

I'm using the xts package to create a sample time series in R. I've created a range of dates, separated by the minute, created sample data for each of those dates, and then finally aggregating on the hour, summing the data. This works, except for one issue.

Once I've done the aggregation, the index does not show each data on the hour, but on the 59th minute. I need the index to show on the hour for merging concerns. Below is my code:

#xts simple example code

BD <- chron("01/01/2015", "00:00:00") # Setting begin date.
ED <- chron("02/01/2015", "23:59:00") # Setting end date.
DS <- seq(BD, ED, by = times("00:01:00")) # Creating a sequence of dates seperated by a minute.

data <- runif(length(DS), 0, 100) # Generating random numerical data the length of the date sequence.

x <- xts(data, DS) # Creates an xts object indexed by the dates of "DS" with data from "data".
colnames(x) <- "Data" # Just renaiming the data column in the xts object.

x.agg <- period.apply(x, endpoints(x, "hours"), sum) # Aggregating by hour

I've tried this method:

index(x.agg) <- index(x.agg) - (1/24/60) * 59

but it gives me this response in the tail:

> index(x.agg) <- index(x.agg) - (1/24/60) * 59
> tail(index(x.agg))
[1] (02/01/15 13:00:00) (02/01/15 14:00:00) (02/01/15 15:00:00) (02/01/15        
16:00:00) (02/01/15 17:00:00)
[6] (02/01/15 18:00:00)

The whole idea above is to simply subtract by 59 minutes to get it on the hour, but it doesn't seem to work. I've also tried truncating and rounding, but they also give me strange results. Any ideas would be greatly appreciated!

giraffehere
  • 1,118
  • 7
  • 18

1 Answers1

0

What you're asking to do is potentially very dangerous. You said you want to align the aggregated data with the beginning of the hour so you can merge it with another series. You could be introducing bias by changing the observation timestamp.

The index does show each data on the hour. Notice that you used a function called endpoints, so the data on the hour is at the end of each hour. If you want to aggregate at the beginning of the hour, use a startpoints function. The function in that answer seems to have a subtle bug in this case. Here's a patched version:

startpoints <- function (x, on = "months", k = 1) {
  c(0, head(endpoints(x, on, k)[-1] + 1, -1))
}

Notice how the Data column is different if you use endpoints versus startpoints. That's an example of the bias you could potentially introduce if you simply changed the index of the output from the endpoints version.

> head(x.agg <- period.apply(x, startpoints(x, "hours"), sum))
                        Data
(01/01/15 07:00:00) 3249.408
(01/01/15 08:00:00) 2967.049
(01/01/15 09:00:00) 3110.734
(01/01/15 10:00:00) 2977.106
(01/01/15 11:00:00) 3046.607
(01/01/15 12:00:00) 3417.428
> head(x.agg <- period.apply(x, endpoints(x, "hours"), sum))
                        Data
(01/01/15 06:59:00) 3236.172
(01/01/15 07:59:00) 2893.148
(01/01/15 08:59:00) 3100.842
(01/01/15 09:59:00) 2996.260
(01/01/15 10:59:00) 3088.869
(01/01/15 11:59:00) 3353.841

Another possible solution is to use align.time on the output from the period.apply call that uses endpoints. align.time(x, 3600) will round the index up to the next hour.

head(x.agg <- align.time(period.apply(x, endpoints(x, "hours"), sum), 3600))
(01/01/15 07:00:00) 3236.172
(01/01/15 08:00:00) 2893.148
(01/01/15 09:00:00) 3100.842
(01/01/15 10:00:00) 2996.260
(01/01/15 11:00:00) 3088.869
(01/01/15 12:00:00) 3353.841
Community
  • 1
  • 1
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • This is extremely close to what I want ultimately. However, I ran it with the substitution of the startpoints function provided and the only issue is that the first aggregated date is represented as: "01/01/15 01:00:00" whereas I would want it represented as: "01/01/15 00:00:00" meaning that this is the aggregation of all those data points whose date was within that hour. I tried substituting in a minus 59 instead of + 1, but it created the same issue at the end of the data. Possibly I am misunderstanding how endpoints parses "hours". – giraffehere Aug 05 '15 at 15:04
  • @giraffehere: use the `startpoints` function from the question I linked to (not my 'patched' version), then change the call to: `period.apply(x, c(0,startpoints(x, "hours")), sum)`. – Joshua Ulrich Aug 05 '15 at 15:11
  • For some odd reason the aggregation for the time "01/01/15 00:00:00" is still only counting the minute of "01/01/15 00:00:00" into the sum (so it's a very small amount), the rest appears to be going into the "01/01/15 01:00:00" hour (much larger amount), so there must be something I'm misunderstanding. But anyways, you've been extremely helpful with this and I don't want to take up anymore of your time. I believe I'll be able to figure it out or an find an alternative in due time. Thank you so much! :) – giraffehere Aug 05 '15 at 19:08
  • @giraffehere: I think what you're misunderstanding is that what you're asking for is potentially very dangerous, so I haven't told you how to do it. You're asking to "aggregate" time series data by creating a data structure that contains data *from the future* at every timestamp. See my edit for another potential solution. – Joshua Ulrich Aug 05 '15 at 20:41
  • While I do see the point you're making, I'm having trouble understanding why wanting to aggregate between (and including) the times (for example) 00:00:00 and 00:59:00 and having that aggregation represented as the hour of 00:00:00 is so bad. And as you said, I do want the time stamp to represent future data as in: "This is the aggregation of the data points that occurred within the hour 00:00:00". `period.apply(x, endpoints(x, "hours"), sum)` does give me the aggregation I want, but not with the right time stamps. – giraffehere Aug 06 '15 at 13:57
  • @giraffehere: The hour 00:00:00 is the *beginning* of the hour, so you're giving that timestamp to data that occurs *after* that timestamp. And you said you wanted to do that so you could merge with hourly data. That means you will be merge hourly data that occurred at 00:00:00 with aggregated data that occurred *after* 00:00:00. That's bad. – Joshua Ulrich Aug 06 '15 at 14:07
  • I definitely see what you're saying, but I think I wasn't clear about the data I'm trying to merge with. The data I'm trying to merge with is also aggregated data, meaning that it is the sum of all data points collected within an hour. It's represented by the hour it occurred within. For example, take the hour 07:00:00. After aggregation that hour will represent the sum of all data points within the 07:00:00 hour, which include, for example, 07:01:00 or 07:45:00, all the way to the last minute (inclusive) 07:59:00. I just want the time series to reflect this indexing "choice" if you will. – giraffehere Aug 06 '15 at 16:22