I have dummy data
structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7,
7), policy_num = c(41551662L, 50966414L, 43077202L, 46927463L,
57130236L, 57050065L, 26196559L, 33545119L, 52304024L, 73953064L,
50340507L, 50491162L, 76577511L, 108067534L), product = c("apple",
"apple", "pear", "apple", "apple", "apple", "plum", "apple",
"pear", "apple", "apple", "apple", "pear", "pear"), start_date =
structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483,
16351, 15429, 15421, 16474, 17205), class = "Date"), end_date = structure(c(15068,
16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447,
15794, 15786, 17205, 17570), class = "Date")), .Names = c("id",
"policy_num", "product", "start_date", "end_date"), row.names = c(NA,
-14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
id policy_num product start_date end_date
1 41551662 apple 2007-04-04 2011-04-04
1 50966414 apple 2012-04-04 2014-04-04
2 43077202 pear 2009-02-01 2018-02-01
3 46927463 apple 2011-11-16 2012-11-16
3 57130236 apple 2012-11-16 2012-11-16
3 57050065 apple 2012-11-16 2014-11-16
4 26196559 plum 1998-11-17 2007-11-17
5 33545119 apple 2011-05-23 2014-06-13
5 52304024 pear 2012-05-23 2018-05-23
5 73953064 apple 2014-10-08 2017-10-08
6 50340507 apple 2012-03-30 2013-03-30
7 50491162 apple 2012-03-22 2013-03-22
7 76577511 pear 2015-02-08 2017-02-08
7 108067534 pear 2017-02-08 2018-02-08
Based on it, I'd like to calculate the following variables (grouped by user_id):
1) Number of currently held product (no_prod_now
) - number of distinct products, whose end_date
> currently evaluated start_date
. Simply, number of products held by user_id
at the time of start_date
2) Number of currently held active policies (no_policies_now
) - as above, but applied to policy_num
3) Number of policies opened within 3 months prior the current start_date
(policies_open_3mo
)
4) policies_closed_3mo
- as above, but number of closed policies in the past 3 months
The desirable output would look like this:
id policy_num product start_date end_date no_prod_now no_policies_now policies_closed_3mo
1 41551662 apple 2007-04-04 2011-04-04 1 1 0
1 50966414 apple 2012-04-04 2014-04-04 1 1 0
2 43077202 pear 2009-02-01 2018-02-01 1 1 0
3 46927463 apple 2011-11-16 2012-11-16 1 1 0
3 57130236 apple 2012-11-16 2012-11-16 1 1 1
3 57050065 apple 2012-11-16 2014-11-16 1 1 2
4 26196559 plum 1998-11-17 2007-11-17 1 1 0
5 33545119 apple 2011-05-23 2014-06-13 1 1 0
5 52304024 pear 2012-05-23 2018-05-23 2 2 0
5 73953064 apple 2014-10-08 2017-10-08 2 2 0
6 50340507 apple 2012-03-30 2013-03-30 1 1 0
7 50491162 apple 2012-03-22 2013-03-22 1 1 0
7 76577511 pear 2015-02-08 2017-02-08 1 1 0
7 108067534 pear 2017-02-08 2018-02-08 1 1 1
policies_open_3mo
0
0
0
0
0
1
0
0
1
0
0
0
0
0
I'm looking for the solution implemented ideally in data.table
, as I'm going to apply it to big data volumes, but base R
or dplyr
solutions I could always convert to data.table
, o would be also valuable, thanks!