-1

I have a data set with IDs, begin date, end date, activity status. The current data set counts 150.000 rows and 50.000 IDs but will be much bigger.

I would like with R regrouping on the same row for the same ID and the same periods (begin date and end dates), the different activities occurring on an overlapping period.

In other words, I try to: 1) reshaping the data to get overlap and non-overlap periods in specific rows. 2) pasting together overlapping activity status.

I already tried:

Find overlapping dates for each ID and create a new row for the overlap

and

Convert data of overlapping time ranges to data of distinct time ranges

Both of these solutions are very good and helped me a lot.

However these two approaches cannot handle in a one piece all the data set of 150.000 rows. I used the second solution by dividing my data set. I needed 2 hours to analyse 18 000 rows.

I wonder if some solution could be more efficient on R. I was thinking using mutate but I'm not very familiar with that function.

The input data looks like this:

ID  START   END STATUS  
5c0e83  2013-11-01  2015-01-01  P1  
5c0e83  2002-09-01  2003-09-01  F2  
5c0e83  2016-09-01  2016-09-01  F0  
5c0e83  2006-01-01  2007-01-01  P3  
5c0e83  2003-11-01  2013-11-01  P2  
5c0e83  2015-09-01  2018-01-01  P0  
5c0e83  2003-09-01  2005-09-01  F1  

Thanks the script previously cited the output have been like this (Convert data of overlapping time ranges to data of distinct time ranges)

ID  STATUS  START   END
5c0e83  F3  2002-09-01  2002-09-01
5c0e83  F3  2002-09-01  2002-10-01
5c0e83  F3  2002-10-01  2002-10-01
5c0e83  F3  2002-10-01  2002-11-01
5c0e83  F3  2002-11-01  2002-11-01
5c0e83  F3  2002-11-01  2003-01-01
5c0e83  F3  2003-01-01  2003-01-01
5c0e83  F3  2003-01-01  2003-09-01
5c0e83  F3, F2  2003-09-01  2003-09-01
5c0e83  F2  2003-09-01  2003-10-01
5c0e83  F2  2003-10-01  2003-10-01
5c0e83  F2  2003-10-01  2003-11-01
5c0e83  F2, P2  2003-11-01  2003-11-01
5c0e83  F2, P2  2003-11-01  2004-01-01
5c0e83  F2, P2  2004-01-01  2004-01-01
5c0e83  F2, P2  2004-01-01  2004-09-01
5c0e83  F2, P2  2004-09-01  2004-09-01
5c0e83  F2, P2  2004-09-01  2004-10-01
5c0e83  F2, P2  2004-10-01  2004-10-01
5c0e83  F2, P2  2004-10-01  2004-11-01

The result is good but the time for treatment is very long. I wonder if someone thinks about a more efficient way to perform this task.

Thanks for your help

2 Answers2

0

This is a little faster than the solution you linked to on the sample data, but I hope significantly faster on your larger data.

My approach was to bring the data into long form, pad each ID-STATUS to get a row for every month in the range, then get the combos for each ID-date, then extract the START and END of each combo and spread it wide again.

library(tidyverse)
df2 <- df %>%
  gather(type, date, START, END) %>%
  group_by(ID, STATUS) %>%
  padr::pad(group = c("ID", "STATUS"), interval = "1 month") %>%
  distinct(ID, STATUS, date) %>%  # For statuses that only last 1 month, only need one row, not both

  # Adapted from https://stackoverflow.com/questions/52010252/convert-data-of-overlapping-time-ranges-to-data-of-distinct-time-ranges/52011136#52011136
  group_by(ID, date) %>%
  summarize(STATUS = paste(unique(STATUS), collapse = ", ")) %>%

  mutate(type = case_when(row_number() == 1   | STATUS != lag(STATUS)  ~ "START",
                          row_number() == n() | STATUS != lead(STATUS) ~ "END",
                          TRUE  ~ NA_character_)) %>%
  filter(!is.na(type)) %>%

  # This last part is to reshape the data to be wide again, with the first line
  #   to keep re-introduced combinations distinct from prior occurences
  group_by(ID, STATUS) %>% mutate(cycle = cumsum(type == "START")) %>% ungroup() %>%
  spread(type, date) %>%
  mutate(END = if_else(is.na(END), START, END)) 
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

Thanks for your help.

Your solution is providing a huge gain of time. From 2 hours I get down to few seconds to treat 12.000 rows.

However, when I load a bigger data frame (150.000 rows) I get the following error:

Error: Estimated 3040980.62465753 returned rows, larger than 1 million in break_above

Traceback: 1. df_ml2 %>% gather(type, date, START, END) %>% group_by(ID, STATUS) %>% padr::pad(group = c("ID", "STATUS"), interval = "1 month") %>% distinct(ID, STATUS, date) %>% group_by(ID, date) %>% summarize(STATUS = paste(unique(STATUS), collapse = ", ")) %>% mutate(type = case_when(row_number() ==

1 | STATUS != lag(STATUS) ~ "START", row_number() == n() |STATUS != lead(STATUS) ~ "END", TRUE ~ NA_character_)) %>% filter(!is.na(type)) %>% group_by(ID, STATUS) %>% mutate(cycle = cumsum(type == "START")) %>% ungroup() %>% spread(type, date) %>% mutate(END = if_else(is.na(END),START, END)) 2. withVisible(eval(quote(_fseq(_lhs)), env, env))

  1. eval(quote(_fseq(_lhs)), env, env)

  2. eval(quote(_fseq(_lhs)), env, env)

  3. _fseq(_lhs)

  4. freduce(value, _function_list)

  5. function_list[i]

  6. padr::pad(., group = c("ID", "STATUS"), interval = "1 month")

  7. break_above_func(return_rows, break_above)

  8. stop(sprintf("Estimated %s returned rows, larger than %s million in break_above", n, threshold), call. = FALSE)

Does the solution consists to apply the script on a list of datasets or there is a way to go beyond the million break-above?

Thanks again for your help.