1

I have a dataset that looks like this:

id      land    datetime
pb1     0       2004-04-05 01:44:00
pb1     1       2004-04-05 02:00:00
pb1     1       2004-04-05 16:00:00 
pb2     1       2004-04-05 18:01:00 
pb2     1       2004-04-05 20:00:00   

library(data.table) 
DT = data.table(
  id = c("pb1", "pb1", "pb1", "pb2", "pb2"), 
  land = c(0L, 1L, 1L, 1L, 1L), 
  datetime = sprintf("2004-04-05 %02d:%02d:00", 
                     c(1, 2, 16, 18, 20), 
                     c(44, 0, 0, 1, 0))
)

I would like to make a column that cumulatively adds time (in days) but ONLY if there is a '1' in the land column. I also would like the count to reset when the id changes.

I have tried a variety of methods using data.table, rleid, and even a nested for loop with no success. I have gotten errors using code like this:

DT[, total :=land*diff(as.numeric(datetime)), .(id, rleid(land))]

I have tried variations of the solution here: Calculating cumulative time in R

I'm not sure the best way to calculate the time interval (no success with difftime or lubridate).

I want the end result to look like this:

id      land           datetime         cumtime.land
pb1     0       2004-04-05 01:44:00     0
pb1     1       2004-04-05 02:00:00     0
pb1     1       2004-04-06 16:00:00     1.58333
pb2     1       2004-04-05 18:00:00     0
pb2     1       2004-04-05 20:00:00     0.08333
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
NorthLattitude
  • 201
  • 1
  • 12

2 Answers2

1

I could not replicate @Japp's comment, but you can easily do this with dplyr.

Depending on what your exact expected output is, you could stop before the summarize call:

library(dplyr)
df=read.table(text=
    "id      land    datetime
    pb1     0       '2004-04-05 01:44:00'
    pb1     1       '2004-04-05 02:00:00'
    pb1     1       '2004-04-06 16:00:00'
    pb1     1       '2004-04-07 16:00:00'
    pb2     1       '2004-04-05 18:00:00' 
    pb2     1       '2004-04-05 20:00:00'", header=T) %>% 
  mutate(datetime=as.POSIXct(datetime,format='%Y-%m-%d %H:%M:%S'))

x = df %>% 
  group_by(id) %>% 
  arrange(id, datetime) %>% 
  mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0, 
                             0,
                             difftime(datetime, lag(datetime), units="days"))) %>% 
  mutate(cumtime.land=time.land + ifelse(is.na(lag(time.land)), 0, lag(time.land)))

  id     land datetime            time.land cumtime.land
  <fct> <int> <dttm>                  <dbl>        <dbl>
1 pb1       0 2004-04-05 01:44:00    0            0     
2 pb1       1 2004-04-05 02:00:00    0            0     
3 pb1       1 2004-04-06 16:00:00    1.58         1.58  
4 pb1       1 2004-04-07 16:00:00    1            2.58  
5 pb2       1 2004-04-05 18:00:00    0            0     
6 pb2       1 2004-04-05 20:00:00    0.0833       0.0833

The key is to use the dplyr::lag() function which takes the "line just above" in the table (which implies that you have to arrange() it beforehand).

By wrapping this inside the ifelse, I'm checking that land and previous land were not 0 (and that we are not in the first line of the id, or lag(anything) will be missing).

I then just reuse the lag() function to get the cumtime.land variable.

Dan Chaltiel
  • 7,811
  • 5
  • 47
  • 92
0

I believe you're after:

DT[land == 1, cumtime.land = 
     cumsum(c(0, diff(as.numeric(datetime))))/86400, by = id]

as.numeric(datetime) converts it to seconds so we use 86400 to convert to days.

Somewhat more "official" in the sense of leveraging time/date classes directly is to use difftime and shift:

DT[land == 1, by = id,
   cumtime.land := 
     cumsum(as.double(difftime(
       datetime, shift(datetime, fill = datetime[1L]), units = 'days'
     )))]

I switched the order of the by argument simply to help with formatting.

We use datetime[1L] to fill so that the initial difference is 0; we need as.double because cumsum errors as it's not confident how to deal with difftime objects as input.

See also:

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198