na.approx function does not produce correct timestamps

Question

I have a large dataset of electric load data with a missing timestamp for the last Sunday of March of each year due to daylight saving time. I have copied below a few rows containing a missing timestamp.

    structure(list(Date_Time = structure(c(1427569200, 1427572800, 
1427576400, 1427580000, 1427583600, 1427587200, NA, 1427590800, 
1427594400, 1427598000, 1427601600, 1427605200), tzone = "EET", class = c("POSIXct", 
"POSIXt")), Day_ahead_Load = c("7139", "6598", "6137", "5177", 
"4728", "4628", "N/A", "4426", "4326", "4374", "4546", "4885"
), Actual_Load = c(6541, 6020, 5602, 5084, 4640, 4593, NA, 4353, 
NA, NA, 4333, 4556)), row.names = c(NA, -12L), class = "data.frame")

#>              Date_Time Day_ahead_Load Actual_Load
#> 1  2015-03-28 21:00:00           7139        6541
#> 2  2015-03-28 22:00:00           6598        6020
#> 3  2015-03-28 23:00:00           6137        5602
#> 4  2015-03-29 00:00:00           5177        5084
#> 5  2015-03-29 01:00:00           4728        4640
#> 6  2015-03-29 02:00:00           4628        4593
#> 7                 <NA>            N/A          NA
#> 8  2015-03-29 04:00:00           4426        4353
#> 9  2015-03-29 05:00:00           4326          NA
#> 10 2015-03-29 06:00:00           4374          NA
#> 11 2015-03-29 07:00:00           4546        4333
#> 12 2015-03-29 08:00:00           4885        4556

I have tried to fill these missing timestamps using na.approx, but the function returns "2015-03-29 02:30:00", instead of "2015-03-29 03:00:00". It does not use the correct scale.

mydata$Date_Time <- as.POSIXct(na.approx(mydata$Date_Time), origin = "1970-01-01 00:00:00", tz = "EET")

#>              Date_Time Day_ahead_Load Actual_Load
#> 1  2015-03-28 21:00:00           7139        6541
#> 2  2015-03-28 22:00:00           6598        6020
#> 3  2015-03-28 23:00:00           6137        5602
#> 4  2015-03-29 00:00:00           5177        5084
#> 5  2015-03-29 01:00:00           4728        4640
#> 6  2015-03-29 02:00:00           4628        4593
#> 7  2015-03-29 02:30:00            N/A          NA
#> 8  2015-03-29 04:00:00           4426        4353
#> 9  2015-03-29 05:00:00           4326          NA
#> 10 2015-03-29 06:00:00           4374          NA
#> 11 2015-03-29 07:00:00           4546        4333
#> 12 2015-03-29 08:00:00           4885        4556

I have also tried using some other functions, such as "fill", but none of them works properly. As I am fairly new to R, I would really appreciate any suggestions for filling the missing timestamps. Thank you in advance.

G. Grothendieck · Accepted Answer · 2020-11-10T13:17:05.073

Actually the answer is correct. There is only one hour difference between the 6th and 8th rows due to the change from standard time to daylight savings time.

Use GMT (or equivalently UTC) if you intended that there be 2 hours between those rows. Below we use the same date and time as a character string but change the timezone to GMT to avoid daylight savings time changes.

diff(mydata[c(6, 8), 1]) 
## Time difference of 1 hours

# use GMT
tt <- as.POSIXct(format(mydata[[1]]), tz = "GMT")
as.POSIXct(na.approx(tt), tz = "GMT", origin = "1970-01-01")
##  [1] "2015-03-28 21:00:00 GMT" "2015-03-28 22:00:00 GMT"
##  [3] "2015-03-28 23:00:00 GMT" "2015-03-29 00:00:00 GMT"
##  [5] "2015-03-29 01:00:00 GMT" "2015-03-29 02:00:00 GMT"
##  [7] "2015-03-29 03:00:00 GMT" "2015-03-29 04:00:00 GMT"
##  [9] "2015-03-29 05:00:00 GMT" "2015-03-29 06:00:00 GMT"
## [11] "2015-03-29 07:00:00 GMT" "2015-03-29 08:00:00 GMT"

You are right, I should not be "filling" the missing timestamp because there is no actual load at that hour. I have a weather data set which is in UTC that I want to index my load data to, therefore I will change the timezone of my load data to UTC. Thank you for enlightening me! — Iro, Nov 10 '20 at 14:29

score 0 · Answer 2 · answered Nov 10 '20 at 12:42

You could use the following loop which would ensure that you always get the correct answer, even if you have many NA's following each other in the data.

library(lubridate)
dat$Date_Time <- as_datetime(as.character(dat$Date_Time))
dat$id <- 1:nrow(dat)
dat$previoustime <- NA
dat$timediff <- NA

for( i in 2:nrow(dat)) {
  previousdateinds <- which(!is.na(dat$Date_Time) & dat$id < i)
  previousdateind <- tail(previousdateinds,1)
  dat$timediff[i] <- i-previousdateind # number of rows between this row and the last non-NA time
  dat$previoustime[i] <- as.character(dat$Date_Time)[previousdateind]
  print(previousdateind)
  }
dat$previoustime <- as_datetime(dat$previoustime)

dat$result <- ifelse(is.na(dat$Date_Time), as.character(dat$previoustime+dat$timediff*60*60), 
                     as.character(dat$Date_Time))
dat[6:8,]
            Date_Time Day_ahead_Load Actual_Load id        previoustime timediff              result
6 2015-03-29 02:00:00           4628        4593  6 2015-03-29 01:00:00        1 2015-03-29 02:00:00
7                <NA>            N/A          NA  7 2015-03-29 02:00:00        1 2015-03-29 03:00:00
8 2015-03-29 04:00:00           4426        4353  8 2015-03-29 02:00:00        2 2015-03-29 04:00:00

I am not sure I understand this solution. Also I should note that I have 47.477 rows of data.. — Iro, Nov 10 '20 at 13:38
you are looping through each of the rows to find out when was the last time a timedate was not `NA`, and you add x hours to this last time depending on the number of rows between last `not NA` and this row. — gaut, Nov 10 '20 at 13:43

na.approx function does not produce correct timestamps

2 Answers2