1

I am trying to transform data into data2. I am looking for a base R or dplyr solution. There is an ID associated with each policy. There is a start date and an end date. These are all given. A policy year starts on the start date and ends one year later. A policy may go for several years. The first part of a policy needs to have a PolYr value of 0. When the policy year goes into the next year, PolYr takes the value 1. I was able to figure that out via

Numeric sequence with condition

For each contract, there is a row for each PolYr and CaldYr combination. I also need to determine the CaldYr. Looking at ID = 103, we see that the contract starts in 2011, its first row will have PolYr = 0 and CaldYr = 2011. The second part of PolYr 0 goes into 2012 so the second row for ID = 103 will have PolYr = 1 and CaldYr = 2012. This policy is more than 2 years in length and finishes in late 2013 so it goes for five rows.

Below are before and after data frames. I did some research, but did not find anything that I perceived as corresponding to my problem.

library(dplyr)    
ID = c(101, rep(102, 2), rep(103,5))
    start = as.Date(c('2/1/2010', rep('5/17/2011', 2), rep('5/17/2011', 5)), '%m/%d/%Y')
    end = as.Date(c('3/5/2010', rep('1/4/2012', 2 ), rep('8/4/2013', 5 )), '%m/%d/%Y')
    data = data.frame(ID = ID, start = start, end = end)

    v = c(0,1)
    data = data %>% group_by(ID) %>% mutate(PolYr = rep_len(v, length(ID)))
    data

    data2 = data
    data2$CaldrYr = c(2010, 2011, 2012, 2011, 2012, 2012, 2013, 2013)
    data2
user2738483
  • 147
  • 1
  • 2
  • 11
  • I don't think you've laid out your logic sufficiently to show what needs to happen, but you could reproduce `data2` with `data %>% group_by(ID) %>% mutate(CaldrYr = lubridate::year(start) + cumsum(PolYr))`. Maybe that's what you're after? – alistaire May 10 '20 at 04:41
  • Yes. That solves my problem. Thanks. – user2738483 May 10 '20 at 04:51

2 Answers2

1

In base R, we can use ave to get cumsum by group and add the year value of start.

data$CaldrYr <- with(data, as.integer(format(start, "%Y")) + 
                            ave(PolYr, ID, FUN = cumsum))

#     ID  start      end        PolYr CaldrYr
#  <dbl> <date>     <date>     <dbl>   <dbl>
#1   101 2010-02-01 2010-03-05     0    2010
#2   102 2011-05-17 2012-01-04     0    2011
#3   102 2011-05-17 2012-01-04     1    2012
#4   103 2011-05-17 2013-08-04     0    2011
#5   103 2011-05-17 2013-08-04     1    2012
#6   103 2011-05-17 2013-08-04     0    2012
#7   103 2011-05-17 2013-08-04     1    2013
#8   103 2011-05-17 2013-08-04     0    2013
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

With data.table, we can do

library(data.table)
library(lubridate)
setDT(data)[, CaldrYr := year(start) + cumsum(PolYr), ID]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @user2738483 it could be either releated to the `year` extraction or `cumsum` but cumsum is faster option. Perhaps change the `year(start)` to `as.integer(substr(start, 1, 4))` – akrun May 10 '20 at 22:03
  • Thanks to both of you. I timed the three methods data.table was the fastest. Then base R. Then dplyr. I used system.time data.table: 3.20 sec base R: 5.97 dplyr: 9.24 I am working with over 2 million rows and hundreds of thousands of ID (= ContractKey) The real bottleneck turns out to be system.time(data_res2 <- data_res2 %>% group_by(ContractKey) %>% mutate(PolYr = rep_len(c(0,1), length(ContractKey)))) # user system elapsed # 924.16 673.19 1598.32 Is there a faster data.table solution? – user2738483 May 10 '20 at 22:13
  • @user2738483 ii think the `substr` method is slower compared to the 'year'. From my experiments with `8e6` data, the original version is the fastest. May be , it is the cumsum that is the issue here – akrun May 10 '20 at 22:17
  • The first method was much faster system.time(setDT(data_res2)[, CaldrYr2 := year(SALE_DATE) + cumsum(PolYr), ContractKey]) # user system elapsed # 3.17 0.06 3.20 system.time(setDT(data_res2)[, PolYrcum := cumsum(PolYr), ContractKey ][, CaldrYr2c := as.integer(substr(SALE_DATE, 1, 4)) + PolYrcum ][, PolYrcum := NULL][]) # user system elapsed # 11.27 0.33 11.53 – user2738483 May 10 '20 at 22:18