1

I have data listing start and end dates for different products for a big number of users. The intervals for different products can overlap or have time gaps between purchases:

user_id start_date  end_date    product
    12  31/10/2010  31/10/2011  A
    12  18/12/2010  18/12/2011  A
    12  31/10/2011  28/04/2014  B
    12  18/12/2011  18/12/2014  A
    12  27/03/2014  27/03/2015  A
    12  18/12/2014  18/12/2016  B
    12  27/03/2015  27/03/2016  B
    12  18/12/2016  18/12/2017  D
    33  01/07/1992  01/07/2016  A
    33  20/08/1993  16/08/2016  B
    33  28/10/1999  15/11/2012  A
    33  31/01/2006  28/02/2006  B
    33  26/08/2016  26/01/2017  C

I would like to obtain the number of days of overlaps of all potential combinations of products for each patient.

user_id A_B       A_C   A_D      B_C    B_D      C_D
12      20 days 0 days  10 days 0 days  0 days  0 days
33      10 days 0 days  0 days  0 days  20 days 20 days
                    

Is there a quick and elegant way to code it, hopefully in dplyr?

Thank you for your help!

Code:

   library(lubridate)
    library(Hmisc)
    library(dplyr)

user_id <- c(rep(12, 8), rep(33, 5))

start_date <- dmy(Cs(31/10/2010,    18/12/2010, 31/10/2011, 18/12/2011, 27/03/2014, 18/12/2014, 27/03/2015, 18/12/2016, 01/07/1992, 20/08/1993, 28/10/1999, 31/01/2006, 26/08/2016))

end_date <- dmy(Cs(31/10/2011,  18/12/2011, 28/04/2014, 18/12/2014, 27/03/2015, 18/12/2016, 27/03/2016, 18/12/2017,
               01/07/2016,  16/08/2016, 15/11/2012, 28/02/2006, 26/01/2017))


 product <- c("A", "A","B","A","A","B","B","D","A","B","A","B", "C")


data <- data.frame(user_id, start_date, end_date, product )
  • Does this answer your question? [Split overlapping date time intervals into non-overlapping intervals, within values of an identifier](https://stackoverflow.com/questions/66396232/split-overlapping-date-time-intervals-into-non-overlapping-intervals-within-val) – M_1 Sep 16 '21 at 15:33
  • You have multiple `A` products in the first group, I would like to know when you calculate the overlapping days for `A-B` which one you are referring to? – Anoushiravan R Sep 16 '21 at 18:34
  • MarcBP: it is a different task, but thank you very much it was still relevant to read Anoushiravan R: sum of all overlaps between A-B – Maurizio Sessa Sep 20 '21 at 09:29

1 Answers1

0

Here is the solution. First, we create the appropriate data table. Note, I have slightly modified your data.

library(tidyverse)
library(lubridate)

df = read.table(
  header = TRUE,text="
user_id start_date  end_date    product
    12  31/10/2010  31/10/2011  A
    12  18/12/2010  18/12/2011  A
    12  31/10/2011  28/04/2014  B
    12  18/12/2011  18/12/2014  A
    12  27/03/2014  27/03/2015  A
    12  18/12/2014  18/12/2016  B
    12  27/03/2015  27/03/2016  B
    12  18/01/2016  18/12/2017  D
    33  01/07/1992  01/07/2016  A
    33  20/08/1993  16/08/2016  B
    33  28/10/1999  15/11/2012  A
    33  31/01/2006  28/02/2006  B
    33  26/08/2006  26/01/2017  C
") %>% as_tibble()

Now we're adding time intervals to the data

df1 = df %>% mutate(
  start_date = start_date %>% dmy(),
  end_date = end_date %>% dmy(),
  product = product %>% fct_infreq(),
  dateint = interval(start_date, end_date)
)

output

# A tibble: 13 x 5
   user_id start_date end_date   product dateint                       
     <int> <date>     <date>     <fct>   <Interval>                    
 1      12 2010-10-31 2011-10-31 A       2010-10-31 UTC--2011-10-31 UTC
 2      12 2010-12-18 2011-12-18 A       2010-12-18 UTC--2011-12-18 UTC
 3      12 2011-10-31 2014-04-28 B       2011-10-31 UTC--2014-04-28 UTC
 4      12 2011-12-18 2014-12-18 A       2011-12-18 UTC--2014-12-18 UTC
 5      12 2014-03-27 2015-03-27 A       2014-03-27 UTC--2015-03-27 UTC
 6      12 2014-12-18 2016-12-18 B       2014-12-18 UTC--2016-12-18 UTC
 7      12 2015-03-27 2016-03-27 B       2015-03-27 UTC--2016-03-27 UTC
 8      12 2016-01-18 2017-12-18 D       2016-01-18 UTC--2017-12-18 UTC
 9      33 1992-07-01 2016-07-01 A       1992-07-01 UTC--2016-07-01 UTC
10      33 1993-08-20 2016-08-16 B       1993-08-20 UTC--2016-08-16 UTC
11      33 1999-10-28 2012-11-15 A       1999-10-28 UTC--2012-11-15 UTC
12      33 2006-01-31 2006-02-28 B       2006-01-31 UTC--2006-02-28 UTC
13      33 2006-08-26 2017-01-26 C       2006-08-26 UTC--2017-01-26 UTC

Now let's create three simple helper functions. The fDayInt function returns the number of days in the common part of two time intervals. The function fSumDayInt returns the sum of the days of overlapping intervals for two products given as parameters. The function fSumComb will return the sum of days for all combinations of products.

fDayInt = function(int1, int2) intersect(int1, int2) %>% 
  as.numeric(.)/(60*60*24)

fSumDayInt = function(df, product1, product2){
  df1A = df %>% 
    filter(product == product1) %>% 
    select(dateint) %>% 
    mutate(join = 1)
  df2B = df %>% 
    filter(product == product2) %>% 
    select(dateint) %>% 
    mutate(join = 1)
  df1A %>% left_join(df2B, by="join") %>% 
    mutate(nday = fDayInt(dateint.x, dateint.y)) %>% 
    summarise(sum.day = sum(nday, na.rm=TRUE)) %>% pull(sum.day)
}

fSumComb = function(df) tibble(
  A_B = df %>% fSumDayInt("A", "B"),
  A_C = df %>% fSumDayInt("A", "C"),
  A_D = df %>% fSumDayInt("A", "D"),
  B_C = df %>% fSumDayInt("B", "C"),
  B_D = df %>% fSumDayInt("B", "D"),
  C_D = df %>% fSumDayInt("C", "D")
)

What we do at the end is child's play!

df1 %>% group_by(user_id) %>% 
  group_modify(~fSumComb(.x))

output

# A tibble: 2 x 7
# Groups:   user_id [2]
  user_id   A_B   A_C   A_D   B_C   B_D   C_D
    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      12  1041     0     0     0   404     0
2      33 13174  5870     0  3643     0     0

Hope this is what you expected.

Marek Fiołka
  • 4,825
  • 1
  • 5
  • 20