4

I'm working with time series data at 5-minute time intervals. Some of the 5-minute time series are missing. I'd like to resample the dataset to fill in the missing 5-minute periods with NaN values. I found great information on how to approach this here: R: Insert rows for missing dates/times.

I've created a data.frame "df" with a POSIXct timeseries column "time".

The pad function in the padr package allows a user to set an interval by the minute, hour, day, etc.

interval
The interval of the returned datetime variable. When NULL the the interval >will be equal to the interval of the datetime variable. When specified it can >only be lower than the interval of the input data. See Details.

padr's pad function will create 1-minute intervals on my 5-minute data. How do I set my own user-defined interval (e.g. 5-minutes)?

www
  • 38,575
  • 12
  • 48
  • 84
Guy
  • 310
  • 2
  • 9
  • 1
    You can pad to the minute and aggregate to five minutes yourself. – Pierre L Mar 03 '17 at 19:22
  • 1
    For the moment non-standard intervals are not yet allowed for by padr. I am working on an implementation (mostly mentally still) that would enable the user to use any interval. Expect this to be on CRAN in two, three months. Untill, either Pierre's answer or lubridate::round_date are fine alternatives. – Edwin Mar 06 '17 at 06:43
  • Edwin, I look forward to an update in the coming months! It will be great to see R have more capabilities like the pandas package in python. – Guy Mar 06 '17 at 14:54

3 Answers3

5

New version hit CRAN yesterday. You can now use units different from 1 in each of the intervals

library(padr)
library(dplyr)
coffee %>% thicken("5 min") %>% select(-time_stamp) %>% pad()
Edwin
  • 3,184
  • 1
  • 23
  • 25
2

Try using the function to pad to the minute then aggregate to the specification you'd like after. This then leads to a custom summary

library(padr)
account <- data.frame(day     = as.Date(c('2016-10-21', '2016-10-23', '2016-10-26')),
                      balance = c(304.46, 414.76, 378.98))

account %>% 
  pad('min') %>%   ##pad to the minute
  mutate(five_min = cut(day, "5 min")) %>%   ##create new 'five_min' column
  group_by(five_min) %>%     ## group by the new col
  summarise(ttl = sum(balance, na.rm=TRUE))  ##aggregate the new sum
# # A tibble: 1,441 × 2
#               five_min    ttl
#                 <fctr>  <dbl>
# 1  2016-10-21 00:00:00 304.46
# 2  2016-10-21 00:05:00   0.00
# 3  2016-10-21 00:10:00   0.00
# 4  2016-10-21 00:15:00   0.00
# 5  2016-10-21 00:20:00   0.00
# 6  2016-10-21 00:25:00   0.00
# 7  2016-10-21 00:30:00   0.00
# 8  2016-10-21 00:35:00   0.00
# 9  2016-10-21 00:40:00   0.00
# 10 2016-10-21 00:45:00   0.00
# # ... with 1,431 more rows
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • I like it, but might use `lubridate::round_date` (which despite the name works with datetimes, too) so as to end up with POSIXct instead of factor. Or just convert back. – alistaire Mar 03 '17 at 20:09
2

While I couldn't get Pierre's solution to run with my data format (which I didn't help in specifying), I was able to create a solution by employing Pierre's strategy in selecting a 5-minute subset of the padded 1-minute interval data. I'm excited about this new padr library and hope more functionality is added down the road.

My strategy was the following:

library(padr)
library(zoo)
dfpad <- pad(df, interval = "min") #resample timeseries df to 1 min intervals
dfpadzoo <- zoo(dfpad,order.by = dfpad$time) #convert padded df to zoo timeseries
sensStart <- start(dfpadzoo) #first time in data using zoo function
sensEnd <- end(dfpadzoo) # last time in data using zoo function
nexttime <- df$time[2] #identify the time in the second data row
#determine time interval in minutes:
tint_min <- as.double(difftime(nexttime,sensStart, tz="UTC",units="mins"))
#Generate regularly-spaced time series from the start to end of data:
timeFill <- seq(from = as.POSIXct(sensStart, tz="UTC"),
                to = as.POSIXct(sensEnd, tz="UTC"), by = 60*tint_min)
#Create subset of dfpad spaced at 5-minute intervals
sensdatazoo <- dfpadzoo[timeFill]

By converting the df to a zoo object, I was able to employ additional time series functionality found in the zoo library.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
Guy
  • 310
  • 2
  • 9
  • 1
    This seems like a lot of code for this task. Would something simple like this work: `sensdatazoo <- merge(dfpadzoo, zoo(,seq(start(dfpadzoo), end(dfpadzoo), by = "5 min")))` – Joshua Ulrich Mar 05 '17 at 13:04
  • Thank you for the great suggestion! I'm new to the R environment and just getting my feet wet in the syntax and available libraries. – Guy Mar 06 '17 at 14:53