1

I have a dataset with periods

active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
   id                 beg                 end 
1:  1 2018-01-01 01:10:00 2018-01-01 01:20:00 
2:  1 2018-01-01 01:50:00 2018-01-01 02:00:00    
3:  2 2018-01-01 01:50:00 2018-01-01 02:00:00    
4:  3 2018-01-01 01:50:00 2018-01-01 02:00:00    

during which an id was active. I would like to aggregate across ids and determine for every point in

time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))

the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like

>ans
                   time  inactive av.time
 1: 2018-01-01 01:10:00         2      30
 2: 2018-01-01 01:11:00         2      29
...
50: 2018-01-01 02:00:00         0       0

I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.

bumblebee
  • 1,116
  • 8
  • 20

1 Answers1

0

Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.

library(tidyverse)

time %>% 
  mutate(dummy = TRUE) %>% 
  inner_join({
    active %>% 
      mutate(dummy = TRUE)
    #join by the dummy variable to get the Cartesian product
  }, by = c("dummy" = "dummy")) %>% 
  select(-dummy) %>% 
  #define what makes an id inactive and the time until it becomes active
  mutate(inactive = time < beg | time > end,
         TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>% 
  #group by time and summarise
  group_by(time) %>% 
  summarise(inactive = sum(inactive),
            av.time = mean(TimeUntilActive, na.rm = TRUE))

# A tibble: 51 x 3
        time            inactive av.time
        <dttm>            <int>   <dbl>
1 2018-01-01 01:10:00        3      40
2 2018-01-01 01:11:00        3      39
3 2018-01-01 01:12:00        3      38
4 2018-01-01 01:13:00        3      37
5 2018-01-01 01:14:00        3      36
6 2018-01-01 01:15:00        3      35
7 2018-01-01 01:16:00        3      34
8 2018-01-01 01:17:00        3      33
9 2018-01-01 01:18:00        3      32
10 2018-01-01 01:19:00        3      31
Jordo82
  • 796
  • 4
  • 14
  • Thanks -- the dummy solution is ineed interesting and I will look into this. However, the data is indeed quite large, spanning several years and several thousand ids. If anyone has a `data.table`solution this would help greatly. – bumblebee Nov 27 '18 at 14:25