0

I have a data.table with two parameters(date and statues), now I want to insert new rows for the whole day based on the original table.

data rules:

  1. the Status column contains only "0" and "1"
  2. the time-stamp which hasnt been recorded in the table is same with its next nearest tiemstamp in the table.
  3. the Date column is always increase by time :)

For example, a simple input:

enter image description here

create the data with the code below:

   dd <- data.table(date = c("2015-07-01 00:00:02", "2015-07-01 00:00:04", "2015-07-01 00:00:08"),
             status = c(0,1,0))

the out put is :

enter image description here

my solution:

  1. calculate the time difference for each two rows, then save in a new column called time_diff
  2. insert new row in the loop based on the time_diff

it could work, but the problem is the calculation time is too long, since so many loops. I thinks there could be an easier solution for this case

Any help or suggestion would be greatly appreciated:)

thanks!

ZAWD
  • 651
  • 7
  • 31
  • In your description, 7/1/2015 0:00:05 has status 1, too, because its nearest time-stamp is 0:00:04 and it has 1. Or should it just be the time-step before? – J_F Oct 24 '16 at 13:52
  • thanks for the question, but 7/1/2015 0:00:05 has not been recorded in the input table, and its next nearest timestamp is 7/1/2015 0:00:08 based on the input table, so 7/1/2015 0:00:05 has status 0 :) – ZAWD Oct 24 '16 at 13:58

2 Answers2

3

Here's another idea

library(dplyr)
library(tidyr)
library(lubridate)

dd %>%
  mutate(date = ymd_hms(date)) %>%
  complete(date = seq(floor_date(min(date), "day"), max(date), 1)) %>%
  fill(status, .direction = "up")

Which gives:

## A tibble: 9 × 2
#                 date status
#               <dttm>  <dbl>
#1 2015-07-01 00:00:00      0
#2 2015-07-01 00:00:01      0
#3 2015-07-01 00:00:02      0
#4 2015-07-01 00:00:03      1
#5 2015-07-01 00:00:04      1
#6 2015-07-01 00:00:05      0
#7 2015-07-01 00:00:06      0
#8 2015-07-01 00:00:07      0
#9 2015-07-01 00:00:08      0
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
1

One way to do this quicker is to use zoo and merge two time series (as shown in the SO answer):

  1. The first is constructed from your data.
  2. The second is a time series with observations that starts and ends as you desire.

Then fill in the NAs from the merge using na.locf. In code:

## first convert your date column to date-time
dd$date <- as.POSIXct(dd$date,format="%Y-%m-%d %H:%M:%S")
## set dd as data frame
setDF(dd)
library(zoo)
## construct zoo time series for your data
dd.zoo <- zoo(dd[,-1],dd[,1])
## do the merge and use `na.locf` to fill in the NA's
output <- na.locf(merge(dd.zoo,
                        zoo(,seq(as.POSIXct("2015-07-01 00:00:00",format="%Y-%m-%d %H:%M:%S"),
                                 end(dd.zoo),by="sec")), all=TRUE),
                  fromLast=TRUE)

Here, the start and end of the time series to merge is 2015-07-01 00:00:00 to the end of your data by seconds. In general you can specify any two points in time. The merge uses all=TRUE to perform an outer join where all observations are joined in the output. Those not in the original data will be filled with NA. Finally, use na.locf with fromLast=TRUE to replace the NA's with the most recent non-NA backwards from the last observation.

Using your data with dd converted to a data frame:

print(output)
##2015-07-01 00:00:00 2015-07-01 00:00:01 2015-07-01 00:00:02 2015-07-01 00:00:03 
##                  0                   0                   0                   1 
##2015-07-01 00:00:04 2015-07-01 00:00:05 2015-07-01 00:00:06 2015-07-01 00:00:07 
##                  1                   0                   0                   0 
##2015-07-01 00:00:08 
##                  0 

Note that the output is a zoo series. To convert back to a data.table:

output <- data.table(date=index(output),status=as.data.frame(output)$output)
##                  date status
##1: 2015-07-01 00:00:00      0
##2: 2015-07-01 00:00:01      0
##3: 2015-07-01 00:00:02      0
##4: 2015-07-01 00:00:03      1
##5: 2015-07-01 00:00:04      1
##6: 2015-07-01 00:00:05      0
##7: 2015-07-01 00:00:06      0
##8: 2015-07-01 00:00:07      0
##9: 2015-07-01 00:00:08      0
Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18
  • hi, thanks for the answer, but i got the error for it : Error in seq.POSIXt(as.POSIXct("2015-07-01 00:00:00", format = "%Y-%m-%d %H:%M:%S"), : 'to' must be a "POSIXt" object – ZAWD Oct 24 '16 at 14:11
  • @ZAWD: see my edit, you need to convert your data to a data frame first. – aichao Oct 24 '16 at 14:12
  • Hi, solved the problem :) thank you so much for the details – ZAWD Oct 24 '16 at 20:33