Calculate number of distinct instances occurring in a given time period

Question

I have dummy data

structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7, 
7), policy_num = c(41551662L, 50966414L, 43077202L, 46927463L, 
57130236L, 57050065L, 26196559L, 33545119L, 52304024L, 73953064L, 
50340507L, 50491162L, 76577511L, 108067534L), product = c("apple", 
"apple", "pear", "apple", "apple", "apple", "plum", "apple", 
"pear", "apple", "apple", "apple", "pear", "pear"), start_date = 
structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483, 
16351, 15429, 15421, 16474, 17205), class = "Date"), end_date = structure(c(15068, 
16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447, 
15794, 15786, 17205, 17570), class = "Date")), .Names = c("id", 
"policy_num", "product", "start_date", "end_date"), row.names = c(NA, 
-14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)


id policy_num product start_date   end_date
 1   41551662   apple 2007-04-04 2011-04-04
 1   50966414   apple 2012-04-04 2014-04-04
 2   43077202    pear 2009-02-01 2018-02-01
 3   46927463   apple 2011-11-16 2012-11-16
 3   57130236   apple 2012-11-16 2012-11-16
 3   57050065   apple 2012-11-16 2014-11-16
 4   26196559    plum 1998-11-17 2007-11-17
 5   33545119   apple 2011-05-23 2014-06-13
 5   52304024    pear 2012-05-23 2018-05-23
 5   73953064   apple 2014-10-08 2017-10-08
 6   50340507   apple 2012-03-30 2013-03-30
 7   50491162   apple 2012-03-22 2013-03-22
 7   76577511    pear 2015-02-08 2017-02-08
 7  108067534    pear 2017-02-08 2018-02-08

Based on it, I'd like to calculate the following variables (grouped by user_id):

1) Number of currently held product (no_prod_now) - number of distinct products, whose end_date > currently evaluated start_date. Simply, number of products held by user_id at the time of start_date

2) Number of currently held active policies (no_policies_now) - as above, but applied to policy_num

3) Number of policies opened within 3 months prior the current start_date (policies_open_3mo)

4) policies_closed_3mo - as above, but number of closed policies in the past 3 months

The desirable output would look like this:

 id policy_num product start_date   end_date no_prod_now no_policies_now policies_closed_3mo
  1   41551662   apple 2007-04-04 2011-04-04           1               1                   0
  1   50966414   apple 2012-04-04 2014-04-04           1               1                   0
  2   43077202    pear 2009-02-01 2018-02-01           1               1                   0
  3   46927463   apple 2011-11-16 2012-11-16           1               1                   0
  3   57130236   apple 2012-11-16 2012-11-16           1               1                   1
  3   57050065   apple 2012-11-16 2014-11-16           1               1                   2
  4   26196559    plum 1998-11-17 2007-11-17           1               1                   0
  5   33545119   apple 2011-05-23 2014-06-13           1               1                   0
  5   52304024    pear 2012-05-23 2018-05-23           2               2                   0
  5   73953064   apple 2014-10-08 2017-10-08           2               2                   0
  6   50340507   apple 2012-03-30 2013-03-30           1               1                   0
  7   50491162   apple 2012-03-22 2013-03-22           1               1                   0
  7   76577511    pear 2015-02-08 2017-02-08           1               1                   0
  7  108067534    pear 2017-02-08 2018-02-08           1               1                   1
policies_open_3mo
                0
                0
                0
                0
                0
                1
                0
                0
                1
                0
                0
                0
                0
                0

I'm looking for the solution implemented ideally in data.table, as I'm going to apply it to big data volumes, but base R or dplyr solutions I could always convert to data.table, o would be also valuable, thanks!

When questions involving dates or times, it is helpful to provide a `dput` of your example data. This makes it easier for use to copy and paste the data and have it ready to work on your problem. This can also reveal issues of improperly formatted variables. — lmo, Aug 18 '17 at 11:37

Uwe · Accepted Answer · 2017-08-21T15:36:23.340

This is quite tricky but can be solved with a number of non-equi self-joins.

Edit: It has turned out that update on join doesn't work together with non-equi self-joins as I had expected (see here). So, I had to revise the code completely to avoid updates in place.

Instead, the four additional columns are created by three separate non-equi self-joins and are combined for the final result.

library(data.table)
library(lubridate)

result <- 
  # create helper column for previous three months periods.
  # lubridate's month arithmetic avoids NAs at end of month, e.g., February
  DT[, start_date_3mo := start_date %m-% period(month = 3L)][
    # start "cbind()" with original columns
  , c(.SD, 
      # count number of products and policies held at time of start_date 
      DT[DT, on = c("id", "start_date<=start_date", "end_date>start_date"), 
         .(no_prod_now = uniqueN(product), no_pols_now = uniqueN(policy_num)), 
         by = .EACHI][, c("no_prod_now", "no_pols_now")],
      # policies closed within previous 3 months of start_date
      DT[DT, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"), 
         .(pols_closed_3mo = .N), by = .EACHI][, "pols_closed_3mo"],
      # additional policies opened within previous 3 months of start_date
      DT[DT, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"), 
         .(pols_opened_3mo = .N - 1L), by = .EACHI][, "pols_opened_3mo"])][
           # omit helper column
           , -"start_date_3mo"]
result

    id policy_num product start_date   end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo
 1:  1   41551662   apple 2007-04-04 2011-04-04           1           1               0               0
 2:  1   50966414   apple 2012-04-04 2014-04-04           1           1               0               0
 3:  2   43077202    pear 2009-02-01 2018-02-01           1           1               0               0
 4:  3   46927463   apple 2011-11-16 2012-11-16           1           1               0               0
 5:  3   57130236   apple 2012-11-16 2012-11-16           1           1               2               1
 6:  3   57050065   apple 2012-11-16 2014-11-16           1           1               2               1
 7:  4   26196559    plum 1998-11-17 2007-11-17           1           1               0               0
 8:  5   33545119   apple 2011-05-23 2014-06-13           1           1               0               0
 9:  5   52304024    pear 2012-05-23 2018-05-23           2           2               0               0
10:  5   73953064   apple 2014-10-08 2017-10-08           2           2               0               0
11:  6   50340507   apple 2012-03-30 2013-03-30           1           1               0               0
12:  7   50491162   apple 2012-03-22 2013-03-22           1           1               0               0
13:  7   76577511    pear 2015-02-08 2017-02-08           1           1               0               0
14:  7  108067534    pear 2017-02-08 2018-02-08           1           1               1               0

Note that there are discrepancies for policies opened within 3 previous months before start_date between OP's expected result and the result here. For id == 3, there are 2 policies starting both on 2012-11-16, so it's one additional policy to count for each row. For id == 5, all start_date differ by more than 3 months, so there shouldn't be an overlap.

Also, rows 5 and 6 both show a value of 2 for policies closed within 3 previous months before start_date because id == 3 has two policies ending on 2012-11-16.

thanks a lot for that, @Uwe_Block! I didn't have access to my work computer over the weekend. Let me test your solution on my data and get back to you with feedback, thanks! — Kasia Kulma, Aug 21 '17 at 07:43
hi again, just by comparing the desired output with the actual one, there are some discrepancies, e.g. in `no_prod_now` 8th row states 2 products, whereas for id == 5 at 2011-05-23 there was clearly 1. 'Also, `pols_closed_3mo` in row 3 should be 0. as it's id == 3's first ever policy, etc. Just to give you a couple of examples. — Kasia Kulma, Aug 21 '17 at 08:00
hi again, I re-read your commentary and agree that `pols_closed_3mo` should be corrected in the expected outcome example, however, I still don't see how `no_prod_now` for id == 5 works... your code seems sound, am I missing something? — Kasia Kulma, Aug 21 '17 at 13:32
With the completely revised code version, `no_prod_now ` should return the correct values for id == 5. — Uwe, Aug 21 '17 at 15:39
this is a fantastic piece of work, thanks for your time and hard work! — Kasia Kulma, Aug 22 '17 at 07:56
Thank you for your feedback. Indeed, it was quite a challenge, four questions in one ;-) . - To be honest, credit goes to @Frank as well for his [excellent explanation](https://stackoverflow.com/a/45797298/3817004) why update in a join might fail if there are multiple matches. — Uwe, Aug 22 '17 at 10:56

Calculate number of distinct instances occurring in a given time period

1 Answers1