Insert rows for missing dates/times

Question

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:

    timestamp          tr            tt         sr         st  
1   9/1/01 0:00   1.018269e+02   -312.8622   -1959.393   4959.828  
2   9/1/01 0:01   1.023567e+02   -313.0002   -1957.755   4958.935  
3   9/1/01 0:02   1.018857e+02   -313.9406   -1956.799   4959.938  
4   9/1/01 0:03   1.025463e+02   -310.9261   -1957.347   4961.095  
5   9/1/01 0:04   1.010228e+02   -311.5469   -1957.786   4959.078

The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.

I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.

I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.

Thanks!

Take a look at this question: http://stackoverflow.com/questions/16742725/adding-missing-rows — Thomas, May 28 '13 at 08:33
I actually found that earlier and must have closed the tab by accident before I could read it! I'll have a play with what they suggest. Thanks! :) — James A, May 28 '13 at 08:51
@Thomas If I load my data as follows: `> # set Home directory > home = setwd(Sys.getenv("HOME")); > > # make path to the csv file > fpath = file.path(home, "Desktop", "at0901.csv"); > # read the csv file > at0901 = read.csv(fpath, header=TRUE);` I then try to convert the 'timestamp' column to POSIXct values as per the post recommended, but get the following: `ts$timestamp <- as.POSIXct(ts$timestamp, format="%m/%d/%y %H:%M") Error in ts$timestamp : object of type 'closure' is not subsettable` Apologies for my cluelessness - I have been thrown this project with very prior coding knowledge! — James A, May 28 '13 at 11:05

score 50 · Answer 1 · answered Jul 17 '15 at 20:53

This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.

library(dplyr)

ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")

ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')

df <- data.frame(timestamp=ts)

data_with_missing_times <- full_join(df,original_data)

   timestamp     tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA

Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.

data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))

   timestamp     tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05  0  0  0  0
7 09/01/01 00:06  0  0  0  0
8 09/01/01 00:07  0  0  0  0

can explain me clearly on the first 3 lines of the code? How are you doing that? I want to do that for different dates for example from 1st March 2017 to 30 September 2017. I'm kind of confused. So, asking you explanation. Thanks in advance. — Toros91, Mar 14 '18 at 08:15
Using the lubridate package, you can generate the Dates in a more intuitive fashion as well. In this code, I demonstrate date generation without the minute and seconds: `package(lubridate); ts <- seq(ymd("1991-01-01"), ymd("2015-12-31"), by="day")` — K Bro, Mar 19 '18 at 17:30
Hi, I am having a similar problem except that I have another variable in front of time stamp called Product Id and this variable repeats the whole timestamp value for each set. I am trying to group my dataset using group_by and then use the timestamp completion function discussed above, it doesnt seem to work. Can you please suggest, what should be done? — user10579790, Jun 06 '19 at 09:35

score 29 · Answer 2 · answered Jan 20 '17 at 15:41

29

Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:

library(padr)
pad(df_name)

See vignette("padr") or this blog post for its working.

answered Jan 20 '17 at 15:41

Edwin

3,184
1
23
25

Thanks @edwin. I had a little trouble with the function not appearing as part of the autocomplete in rstudio. It's fixed now. Thanks for your reply. – Dan Apr 11 '17 at 12:46
Wow, this seems to be a pretty convenient solution. Unfortunately I get "Error in if (!all(dt_var[1:(length(dt_var) - 1)] <= dt_var[2:length(dt_var)])) { : missing value where TRUE/FALSE needed" – smudo78 Apr 11 '17 at 13:43
Could you maybe share an example producing this error? Please use the package's github: edwinth/padr – Edwin Apr 11 '17 at 14:31

score 28 · Accepted Answer · answered May 28 '13 at 12:07

28

I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:

df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")

df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index

df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)

Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.

answered May 28 '13 at 12:07

Herr Student

853
14
26

This worked for me - I now have a complete time series with NAs for the missing Y values. Thank you so much! – James A May 29 '13 at 06:03
I get 2 hours difference between the starting row at df1.zoo and df2. df2 starts 2 hours later! Any idea? – Rotail Sep 07 '16 at 21:08
I get ``Error in del/by : non-numeric argument to binary operator`` when I do this. Any ideas why? – jblakeley Dec 05 '17 at 20:47

score 19 · Answer 4 · answered Apr 26 '19 at 19:34

19

I think this can accomplished by using complete in tidyr package.

library(tidyverse)
df <- df %>%
      complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"), 
               tr, tt, sr,st)

you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).

answered Apr 26 '19 at 19:34

ok1more

779
6
15

This answer is super helpful, thanks! When using this, have you ever received the error "Long vectors are not yet supported. Requested output size must be less than 2147483647"? I can't seem to work around this, and I'm only working with a subset of my data as is (to speed up processing until I get the code right). – cgxytf Oct 20 '21 at 17:34
I have not yet seen this error, if you can share an example of what you are trying to do, may be it could be looked in to be experts. – ok1more Dec 09 '21 at 18:53

score 2 · Answer 5 · answered May 28 '13 at 11:20

2

# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
    tr = rnorm(4,0,1),
    tt = rnorm(4,0,1))

originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")

# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)

# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")

# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")

answered May 28 '13 at 11:20

Thomas

43,637
12
109
140

Superb, let me dig through that and see if I can understand what's going on and how I can modify it to do what I need. Looks clear (relatively speaking!). Thank you! – James A May 28 '13 at 11:47
Great! If you find the answer helpful, I'd appreciate you accepting the answer by checking the gray mark to the left. All the best, -Thomas – Thomas May 28 '13 at 11:57
This looks like it will work - no errors, just a few issues. I actually messed up the date format in my original post - it should be `%y-%m-%d %H:%M`. Regardless, I've edited the code to account for this, and the new timestamps are coming out a little wrong. For example, using data for 2009-01-01 to 2009-01-31 (January 1st to 31st 2009), I get the following as the result in 'newdf': `1 2009-09-01 00:00:00 NA NA NA NA` - in other words, it is getting the timestamps confused (setting month from 01 to 09) and filling all of the columns with NA even when Y values should be available. – James A May 29 '13 at 05:29
The code I'm using (trying to process data for all of January 2009) is: `# set timestamps as POSIXct values at0901mod2$minAsPOSIX <- as.POSIXct(at0901mod2$timestamp, format="%y-%m-%d %H:%M", tz="") # Generate vector of all minutes ndays <- 31 # number of days to generate minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60) # convert those minutes to POSIX minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="") # new df newdf <- merge(data.frame(minAsPOSIX),at0901mod2,all.x=TRUE, by="minAsPOSIX")` – James A May 29 '13 at 05:55

score 0 · Answer 6 · answered Feb 16 '16 at 17:32

In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:

df[is.na(df)] <- 0

(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)

score 0 · Answer 7 · answered Sep 07 '16 at 21:45

df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"

full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df  <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame

full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge

score 0 · Answer 8 · answered May 19 '17 at 06:56

I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:

date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
    date <- date %m+% months(1) 
    date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)

This will give me a list of dates in incremental months. Then I join

df_with_missing_months <- full_join(df_1,df)

Loops in R are hardly the right answer..I'm sure there's a much more efficient way of doing this. — RLave, Sep 30 '21 at 15:17

score 0 · Answer 9 · answered Mar 26 '19 at 09:23

There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:

library(tsibble)
library(lubridate)

ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
                      tr        = rnorm(4,0,1),
                      tt        = rnorm(4,0,1),
                      index     = timestamp)

originaldf %>% 
  fill_gaps()

Insert rows for missing dates/times

9 Answers9

Linked

Related