1

I have a data frame Data1 with three columns: NoContract,IniDate,FinDate representing the identifier of a contract, when a contract begins and when finishes respectively. On the other hand I have a period of analysis: January 1, 2012 to December 31, 2014. I want to find how many contracts where active in each month of the analysis period, by active I mean that a contract has at least one day of its dates between IniDate and FinDate in a month of analysis period.

I tried in R doing:

Lets say Data1 is:

Data1 <- data.frame(NoContract= 1:3, IniDate= as.Date(c("2011-05-03","2012-03-13","2014-03-26")),FinDate=as.Date(c("2015-01-05","2013-03-13","2015-08-19")))
Data1

  NoContract    IniDate    FinDate
1          1 2011-05-03 2015-01-05
2          2 2012-03-13 2013-03-13
3          3 2014-03-26 2015-08-19

I´ve created another data frame DatesCalc as:

DatesCalc<-data.frame(monthI=seq(as.Date("2012-01-01"), as.Date("2014-12-31"), by="1 month"), monthF=(seq(as.Date("2012-02-01"), as.Date("2015-01-01"), by="1 month")-1))
head(DatesCalc)

      monthI     monthF
1 2012-01-01 2012-01-31
2 2012-02-01 2012-02-29
3 2012-03-01 2012-03-31
4 2012-04-01 2012-04-30
5 2012-05-01 2012-05-31
6 2012-06-01 2012-06-30

Next, I wrote a function

myfun<-function(X,Y){
  d1<-numeric()
  d2<-numeric()
  for (i in 1:36){ #36 num of rows on DatesCalc
    d1<-numeric()
    for (j in 1:3){ #3 num of rows of my Data1 (my actual case near 550K rows)
      d1<-c(d1,sum(seq(X[i,1],X[i,2],by=1)%in%seq(Y[j,2],Y[j,3],by=1),na.rm=TRUE)>0)
    }
d2<-cbind(d2,d1)
  }
  return(d2)
}

So what it does is, for each row of Data1, creates a sequence of dates of each row of DatesCalcand prove if it is within the sequence of dates of the current row of Data1. This function returns a matrix where rows represent a contract and columns months from Jan 2012 to Dec 2014, and each cell has 1 if in a month the contract was active, and 0 if not (see Res). Finally I used apply to sum by column and got what i want.

Res<-myfun(DatesCalc,Data1)
Res
     d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1
[1,]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[2,]  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[3,]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1

apply(Res,2,sum)
d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 d1 
 1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2

The case is that I have hundreds of thousand of rows (550K) in my actual Data1, and running myfun on it is inefficient. My question is, maybe is a way doing this efficient in R? Or any suggest on how to improve my code. Thank you Comunnity.

Chriss Paul
  • 1,101
  • 6
  • 19
  • 1
    This seems like a case for `foverlaps` in the `data.table` package. You will find some very nice Q&A on SO, e.g. [here](http://stackoverflow.com/questions/27487949/how-to-perform-join-over-date-ranges-using-data-table/27488219#27488219) – Henrik Jun 09 '15 at 21:05
  • Thank you @Henrik I will try with `foverlaps` – Chriss Paul Jun 09 '15 at 21:13
  • I have a similar problem can you help me https://stackoverflow.com/questions/73171451/create-dates-in-columns-from-a-defined-period-in-r – Johan Jul 30 '22 at 00:20

1 Answers1

4

Here an option using data.table foverlaps.

  1. First , foverlaps is a merge using interval. You should have the same column names to do the merge. You should also set the keys of the second table.
  2. LThe desired output is a matrix where rows represent a contract and columns months from Jan 2012 to Dec 2014, so I cretaed a new column period which is the year-month of the contract.
  3. Reshape the result in the wide format using dcast.data.table .

The code:

library(data.table)
setDT(Data1)
setDT(DatesCalc)
setkey(Data1, IniDate, FinDate)   ## Set keys for merge 
setnames(DatesCalc,names(DatesCalc),c('IniDate','FinDate')) ## rename for merge
dcast.data.table(        ## wide format
  foverlaps(DatesCalc, Data1, type="within")[,
        period := format(i.IniDate,'%Y-%m')], ## create a new variable here
  NoContract~period,fun=length) ## the aggregate function is the length (T/F)

  NoContract 2012-01 2012-02 2012-03 2012-04 2012-05 2012-06 2012-07 2012-08 2012-09 2012-10 2012-11 2012-12 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07
1:          1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
2:          2       0       0       0       1       1       1       1       1       1       1       1       1       1       1       0       0       0       0       0
3:          3       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
   2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 2014-07 2014-08 2014-09 2014-10 2014-11 2014-12
1:       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
2:       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
3:       0       0       0       0       0       0       0       0       1       1       1       1       1       1       1       1       1
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • 1
    Amazing! It worked on my big table with 560K rows. Thank you so much. – Chriss Paul Jun 09 '15 at 22:02
  • actually, we came to two different answers, I think with `period := format(i.IniDate,'%Y-%m')` you are taking only the initial date for constructing the matrix. As you can see the second contract is available from `2012-03-13`, in you case: column of `2012-03`, contract `3` you got `0` and must be `1` . Nevertheless is a great approach to what i need, and now I´m trying using `foverlaps` in order to get what i want that is how many contracts are active in each month of the analysis period. – Chriss Paul Jun 10 '15 at 15:57
  • @ChrissPaul if you want to compute how many contracts for each period , no need to reshape your data in the wide format , just aggregate by period , something like this `foverlaps(DatesCalc, Data1, type="within")[,period := format(i.IniDate,'%Y-%m')][,.N,period]` will give you the right answer. – agstudy Jun 10 '15 at 16:41
  • in fact that´s what i was looking for. However is still a little bug cause, in case of `type="within"` of `foverlaps` after reading its help files, I found that only takes into account full content of intervals, so we should use `type="any"`in order to overlap intervals less than a month (i.e. cases where in a given month there were only few days that contract was active) or am I misunderstanding smthng? The case is that we get same result when: `foverlaps(DatesCalc, Data1, type="any")[,period := format(i.IniDate,'%Y-%m')][,.N,period]`. Again thank you so much for your help. – Chriss Paul Jun 10 '15 at 17:52