4

I have a data set with time periods, that may overlap, showing me if somebody was present (example_df). I want to get a data set that splits a large time period (from 2014-01-01 to 2014-10-31) into smaller time periods where somebody was present (present = 1) and time periods where nobody was present (present = 0). The result should look like result_df

Example data frame

example_df <- data.frame(ID = 1, 
                     start = c(as.Date("2014-01-01"), as.Date("2014-03-05"), as.Date("2014-06-13"), as.Date("2014-08-15")), 
                     end = c(as.Date("2014-04-07"), as.Date("2014-04-12"), as.Date("2014-08-05"), as.Date("2014-10-02")), 
                     present = 1) 

Result should look like this

result_df <- data.frame(ID = 1, 
                     start = c(as.Date("2014-01-01"), as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02")), 
                     end = c(as.Date("2014-04-12"), as.Date("2014-06-13"), as.Date("2014-08-05"), as.Date("2014-08-15"), as.Date("2014-10-02"), as.Date("2014-10-31")), 
                     present = c(1, 0, 1, 0, 1, 0)) 

I have no idea how to tackle this problem as it requires to split time periods or add rows (or something else?). Any help is much appreciated!

adibender
  • 7,288
  • 3
  • 37
  • 41
user138089
  • 99
  • 1
  • 5

2 Answers2

3

I hope I can be helpful, as I have struggled with this as well.

As in IceCreamToucan's example, this assumes independence by person ID. This approach uses dplyr to look at overlap in date ranges and then flattens them. Other examples of this approach have been described in stackoverflow and use dplyr. The end result includes time ranges where the person is present.

library(tidyr)
library(dplyr)

pres <- example_df %>%
  group_by(ID) %>%
  arrange(start) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) > cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = min(start), end = max(end), present = 1) %>%
  select(-indx)

Then, additional rows can be added to indicate time period when not present. In these cases, for a given ID, it will determine gaps between an older end date and a newer (more recent) start date. Then finally the result is ordered by ID and the start date.

result <- pres

for (i in unique(pres$ID)) {
  pres_i <- subset(pres, ID == i)
  if (nrow(pres_i) > 1) {
    adding <- data.frame(ID = i, start = pres_i$end[-nrow(pres_i)]+1, end = pres_i$start[-1]-1, present = 0)
    adding <- adding[adding$start <= adding$end, ]
    result <- bind_rows(result, adding)
  }
}
result[order(result$ID, result$start), ]

# A tibble: 5 x 4
# Groups:   ID [1]
     ID start      end        present
  <dbl> <date>     <date>       <dbl>
1     1 2014-01-01 2014-04-12       1
2     1 2014-04-13 2014-06-12       0
3     1 2014-06-13 2014-08-05       1
4     1 2014-08-06 2014-08-14       0
5     1 2014-08-15 2014-10-02       1
Ben
  • 28,684
  • 5
  • 23
  • 45
1

Assuming you want to do it separately for each ID, you can create a data table with all dates for which someone was present, and join that with a table of all dates over that time period. The result is not exactly the same, because the present and not-present periods don't overlap.

library(data.table)
setDT(example_df)


example_df[, {
  pres <- unique(unlist(Map(`:`, start, end)))
  class(pres) <- 'Date'
  all <- min(pres):max(pres)
  class(all) <- 'Date'
  pres <- data.table(day = pres)
  all <- data.table(day = all)
  out.full <- pres[all, on = .(day), .(day = i.day, present = +!is.na(x.day))]
  out.full[, .(start = min(day), end = max(day)), 
           by = .(present, rid = rleid(present))][, -'rid']
  }, by = ID]

#    ID present      start        end
# 1:  1       1 2014-01-01 2014-04-12
# 2:  1       0 2014-04-13 2014-06-12
# 3:  1       1 2014-06-13 2014-08-05
# 4:  1       0 2014-08-06 2014-08-14
# 5:  1       1 2014-08-15 2014-10-02
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38