I have a data.table that looks like this
CUSTOMER_ID START_DATE END_DATE COUNTRY
1 2006-11-10 2010-10-25 USA
2 2006-11-10 2018-09-29 FRANCE
3 2017-08-23 2018-09-29 CHINA
4 2007-01-30 2018-09-29 JAPAN
5 2007-01-30 2018-09-29 COLOMBIA
6 2007-01-30 2017-01-30 BRAZIL
7 2007-01-30 2017-01-30 USA
8 2012-03-15 2018-09-29 USA
9 2012-03-15 2018-09-29 CHINA
10 2006-11-10 2015-11-06 CHINA
With these data I'd like to do the following:
Calculate the difference in years between two dates
Based on the calculated number of years, I want to create a new column "PERIOD", to display each period each customer has been active, using the start and end year in the START_DATE and END_DATE columns. The end result should be something like this:
ID PERIOD COUNTRY 1 2006-2007 USA 1 2007-2008 USA 1 2008-2009 USA 1 2009-2010 USA 2 2006-2007 FRANCE 2 2007-2008 FRANCE 2 2008-2009 FRANCE 2 2009-2010 FRANCE 2 2010-2011 FRANCE 2 2011-2012 FRANCE 2 2012-2013 FRANCE 2 2013-2014 FRANCE 2 2014-2015 FRANCE 2 2015-2016 FRANCE 2 2016-2017 FRANCE 2 2017-2018 FRANCE 3 2017-2018 CHINA 3 2018- CHINA
So far, I've been able to get to number 1 above (calculate the difference in years between the START_DATE and the END_DATE):
date_cols <- c("START_DATE", "END_DATE")
DT[, (date_cols) := lapply(.SD, anytime::anydate), .SDcols = date_cols]
DT[,YEARS_ACTIVE:= ceiling(as.numeric(END_DATE-START_DATE)/365)]
Any help is much appreciated