1

I have data that looks like

ID  CLM_ID  Date1   Date2
1   718182  1/1/2014    1/17/2014
1   718184  1/2/2014    1/8/2014
1   885236  1/15/2014   1/17/2014
1   885362  3/20/2014   3/21/2014
2   589963  3/18/2015   3/22/2015
2   589999  2/27/2015   5/9/2015
2   594226  4/11/2015   4/17/2015
2   689959  5/10/2015   6/10/2015
3   656696  5/1/2016    5/5/2016
3   669625  5/6/2016    5/22/2016
4   777777  2/21/2015   3/4/2015
4   778952  2/1/2015    2/28/2015
4   778965  3/1/2015    3/22/2015

I am working on two different problems with this. The first one was answered in a previous post about how to roll dates up (Date roll-up in R) and the second now is that I have intervals that are within intervals and I am trying to get rid of them. So the final outcome should look like

ID  CLM_ID  Date1   Date2
1   718182  1/1/2014    1/17/2014
1   885362  3/20/2014   3/21/2014
2   589999  2/27/2015   5/9/2015
3   656696  5/1/2016    5/22/2016
4   778952  2/1/2015    3/22/2015

Now I know I will have to create the extended intervals via the date rollup first, but then how do I get rid of these sub-intervals (a term I am making up for intervals within intervals)? I am also looking for a solution that is efficient since I actually have 75,000 records to go through (i.e. I am trying to avoid iterative solutions).

Community
  • 1
  • 1
akash87
  • 3,876
  • 3
  • 14
  • 30

3 Answers3

2

"I am also looking for a solution that is efficient ... (i.e. I am trying to avoid iterative solutions)."

"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in." - Isaac Asimov

Below is a super fast base R iterative solution. It returns the correct results for very large data frames virtually instantly. (it also "rolls-up" the data, so there is no need to carry out two algorithms):

MakeDFSubInt <- function(df, includeCost = FALSE) {
    ## Sorting the data frame to allow for fast
    ## creation of the "Contained" logical vector below
    tempDF <- df[order(df$ID, df$Date1, df$Date2), ] 
    UniIDs <- unique(tempDF$ID)
    Len <- length(UniIDs)

    ## Determine starting (i.e. "s") and ending (i.e. "e")
    ## points of the respective groups of IDs
    e <- which(diff(tempDF$ID)==1)
    s <- c(1L, e + 1L)
    dfLen <- nrow(tempDF)
    e <- c(e, dfLen)

    ## Converting dates to integers so that comparison
    ## will be faster. Internally dates are stored as
    ## integers, so this isn't a problem
    dte1 <- as.integer(tempDF$Date1)
    dte2 <- as.integer(tempDF$Date2)

    ## Building logical vector in order to quickly create sub-intervals
    Contained <- rep(FALSE, dfLen)

    BegTime <- Sys.time()  ## Included to measure time of for loop execution

    for (j in 1:Len) {
        Compare <- ifelse(dte2[s[j]] >= (dte1[s[j]+1L]+1L), max(dte2[s[j]], dte2[s[j]+1L]), dte2[s[j]+1L])
        for (x in (s[j]+1L):e[j]) {
            if (!Contained[x-1L]) {
                Contained[x] <- dte2[x-1L] >= (dte1[x]-1L)
            } else {
                Contained[x] <- Compare >= (dte1[x]-1L)
            }

            ## could use ifelse, but this construct is faster
            if (Contained[x]) {  
                Compare <- max(Compare, dte2[x])
            } else {
                Compare <- dte2[x]
            }
        }
    }

    EndTime <- Sys.time()
    TotTime <- EndTime - BegTime
    if (printTime) {print(paste(c("for loop execution time was: ", format(TotTime)), collapse = ""))}

    ## identify sub-intervals
    nGrps <- which(!Contained)

    ## Create New fields for our new DF
    ID <- tempDF$ID[nGrps]
    CLM_ID <- tempDF$CLM_ID[nGrps]
    Date1 <- tempDF$Date1[nGrps]
    nGrps <- c(nGrps, dfLen+1L)

    ## as.Date is converting numbers to dates. 
    ## N.B. This only works if origin is supplied
    Date2 <- as.Date(vapply(1L:(length(nGrps) - 1L), function(x) {
                     max(dte2[nGrps[x]:(nGrps[x+1L]-1L)])}, 1L), origin = "1970-01-01")

    ## in a related question the OP had, "Cost" was
    ## included to show how the algorithm would handle
    ## generic summary information
    if (includeCost) {
        myCost <- tempDF$Cost
        Cost <-  vapply(1L:(length(nGrps) - 1L), function(x) sum(myCost[nGrps[x]:(nGrps[x+1L]-1L)]), 100.01)
        NewDf <- data.frame(ID,CLM_ID,Date1,Date2,Cost)
    } else {
        NewDf <- data.frame(ID,CLM_ID,Date1,Date2)
    }

    NewDf
}

For the example given in the question, we have:

ID <- c(rep(1,4),rep(2,4),rep(3,2),rep(4,3))
CLM_ID <- c(718182, 718184, 885236, 885362, 589963, 589999, 594226, 689959, 656696, 669625, 777777, 778952, 778965)
Date1 <- c("1/1/2014","1/2/2014","1/15/2014","3/20/2014","3/18/2015","2/27/2015","4/11/2015","5/10/2015","5/1/2016","5/6/2016","2/21/2015","2/1/2015","3/1/2015")
Date2 <- c("1/17/2014","1/8/2014","1/17/2014","3/21/2014","3/22/2015","5/9/2015","4/17/2015","6/10/2015","5/5/2016","5/22/2016","3/4/2015","2/28/2015","3/22/2015")
myDF <- data.frame(ID, CLM_ID, Date1, Date2)
myDF$Date1 <- as.Date(myDF$Date1, format = "%m/%d/%Y")
myDF$Date2 <- as.Date(myDF$Date2, format = "%m/%d/%Y")

MakeDFSubInt(myDF)
ID CLM_ID      Date1      Date2
1  1 718182 2014-01-01 2014-01-17
2  1 885362 2014-03-20 2014-03-21
3  2 589999 2015-02-27 2015-06-10
4  3 656696 2016-05-01 2016-05-22
5  4 778952 2015-02-01 2015-03-22

From a similar question the OP posted, we can add a Cost field, to show how we would proceed with calculations for this setup.

set.seed(7777)
myDF$Cost <- round(rnorm(13, 450, sd = 100),2)

MakeDFSubInt(myDF,  includeCost = TRUE)
ID   CLM_ID      Date1      Date2    Cost
1  1 718182 2014-01-01 2014-01-17 1164.66
2  1 885362 2014-03-20 2014-03-21  568.16
3  2 589999 2015-02-27 2015-06-10 2019.16
4  3 656696 2016-05-01 2016-05-22  990.14
5  4 778952 2015-02-01 2015-03-22 1578.68

This algorithm scales very well. For data frames the size the OP is looking for, returning the requested DF returns almost instantaneously and for very large data frames, it returns in just seconds.

First we build a function that will generate a random data frame with n rows.

MakeRandomDF <- function(n) {
    set.seed(109)

    CLM_Size <- ifelse(n < 10^6, 10^6, 10^(ceiling(log10(n))))
    numYears <- trunc((6/425000)*n + 5)
    StrtYear <- ifelse(numYears > 16, 2000, 2016 - numYears)
    numYears <- ifelse(numYears > 16, 16, numYears)

    IDs <- sort(sample(trunc(n/100), n, replace = TRUE))
    CLM_IDs <- sample(CLM_Size, n)
    StrtDate <- as.Date(paste(c(as.character(StrtYear),"-01-01"), collapse = ""))
    myPossibleDates <- StrtDate+(0:(numYears*365))  ## "numYears" years of data
    Date1 <- sample(myPossibleDates, n, replace = TRUE)
    Date2 <- Date1 + sample(1:100, n, replace = TRUE)
    Cost <- round(rnorm(n, 850, 100), 2)

    tempDF <- data.frame(IDs,CLM_IDs,Date1,Date2,Cost)
    tempDF$Date1 <- as.Date(tempDF$Date1, format = "%m/%d/%Y")
    tempDF$Date2 <- as.Date(tempDF$Date2, format = "%m/%d/%Y")

    tempDF
}

For moderate size DFs (i.e. 75,000 rows)

TestDF <- MakeRandomDF(75000)
system.time(test1 <- MakeDFSubInt(TestDF, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.06500006 secs"
  user  system elapsed 
  0.14    0.00    0.14 

nrow(test1)
[1] 7618

head(test1)
  ID CLM_ID      Date1      Date2     Cost
1  1 116944 2010-01-29 2010-01-30   799.90  ## The range of dates for 
2  1 515993 2010-02-15 2011-10-12 20836.83  ## each row are disjoint
3  1 408037 2011-12-13 2013-07-21 28149.26  ## as requested by the OP
4  1  20591 2013-07-25 2014-03-11 10449.51
5  1 338609 2014-04-24 2014-07-31  4219.48
6  1 628983 2014-08-03 2014-09-11  2170.93


For very large DFs (i.e. > 500,000 rows)

TestDF2 <- MakeRandomDF(500000)
system.time(test2 <- MakeDFSubInt(TestDF2, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.3679998 secs"
  user  system elapsed 
  1.19    0.03    1.21 

nrow(test2)
[1] 154839

head(test2)
  ID CLM_ID      Date1      Date2    Cost
1  1  71251 2004-04-19 2004-06-29 2715.69  ## The range of dates for 
2  1 601676 2004-07-05 2004-09-23 2675.04  ## each row are disjoint
3  1 794409 2004-12-28 2005-04-05 1760.63  ## as requested by the OP
4  1 424671 2005-06-03 2005-08-20 1973.67
5  1 390353 2005-09-16 2005-11-06  785.81
6  1 496611 2005-11-21 2005-11-24  904.09

system.time(test3 <- MakeDFSubInt(TestDF3, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.6930001 secs"
  user  system elapsed 
  2.68    0.08    2.79      ## 1 million rows in under 3 seconds!!!

nrow(test3)
[1] 413668


Explanation

The main part of the algorithm is generating the Contained logical vector that is used to determine the sub-intervals of continuous dates. Generation of this vector relies on the fact that the data frame is sorted, first by ID, second by Date1, and finally by Date2. We begin by locating the starting and ending rows of each group of IDs. For example, with the example provided by the OP we have:

myDF
   ID CLM_ID      Date1      Date2
1   1 718182 2014-01-01 2014-01-17    ## <- 1   s[1]
2   1 718184 2014-01-02 2014-01-08
3   1 885236 2014-01-15 2014-01-17
4   1 885362 2014-03-20 2014-03-21    ## <- 4   e[1]
5   2 589963 2015-03-18 2015-03-22    ## <- 5   s[2]
6   2 589999 2015-02-27 2015-05-09
7   2 594226 2015-04-11 2015-04-17
8   2 689959 2015-05-10 2015-06-10    ## <- 8   e[2]
9   3 656696 2016-05-01 2016-05-05    ## <- 9   s[3]
10  3 669625 2016-05-06 2016-05-22    ## <- 10  e[3]
11  4 777777 2015-02-21 2015-03-04    ## <- 11  s[4]
12  4 778952 2015-02-01 2015-02-28
13  4 778965 2015-03-01 2015-03-22    ## <- 13  e[4]

Below is the code that generates s and e.

## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)

s
1  5  9   11

e
4  8  10  13

Now, we loop over each group and begin populating the logical vector Contained. If the date range for a particular row overlaps (or is a continuance of) the date range above it, we set that particular index of Contained to TRUE. This is why the first row in each group is set to FALSE since there is nothing above to compare it to. As we are doing this, we are updating the largest date to compare against moving forward, hence the Compare variable. It should be noted that it isn't necessarily true that Date2[n] < Date2[n+1L], this is why Compare <- max(Compare, dte2[x]) for a succession of TRUEs. The result for our example is give below.

   ID CLM_ID      Date1      Date2 Contained
1   1 718182 2014-01-01 2014-01-17     FALSE
2   1 718184 2014-01-02 2014-01-08      TRUE  ## These two rows are contained
3   1 885236 2014-01-15 2014-01-17      TRUE  ## in the date range 1/1 - 1/17
4   1 885362 2014-03-20 2014-03-21     FALSE  ## This row isn't
6   2 589999 2015-02-27 2015-05-09     FALSE
5   2 589963 2015-03-18 2015-03-22      TRUE
7   2 594226 2015-04-11 2015-04-17      TRUE
8   2 689959 2015-05-10 2015-06-10      TRUE  ## N.B. 5/10 is a continuance of 5/09
9   3 656696 2016-05-01 2016-05-05     FALSE
10  3 669625 2016-05-06 2016-05-22      TRUE
12  4 778952 2015-02-01 2015-02-28     FALSE
11  4 777777 2015-02-21 2015-03-04      TRUE
13  4 778965 2015-03-01 2015-03-22      TRUE

Now we can easily identify the "starting" rows by identifying all rows with a corresponding FALSE. After this, finding summary information is a breeze by simply calculating whatever you are interested in (e.g. max(Date2), sum(Cost)) over each succession of TRUEs and Voila!!

Community
  • 1
  • 1
Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • Thanks for this explanation. Besides `system.time` is there a way while programming to know the efficiency of a loop? – akash87 Jul 01 '16 at 18:19
  • @akash87, There is a lot of misunderstanding about `for loops` in R. Yes, they can be slow, but as Richie Cotton points out [here](http://stackoverflow.com/a/6466415/4408538) (check out the accepted answer as well on that page from Gavin Simpson), if you have a non-trivial task (like we do above), `for` loops are the way to go (if implemented properly). I highly recommend reading [this post](http://stackoverflow.com/a/2276001/4408538), and [this post](http://stackoverflow.com/q/28983292/4408538) to get a better understanding of R’s looping constructs. – Joseph Wood Jul 01 '16 at 18:48
  • @akash87, I have modified my algorithm to return the time it takes to execute the `for loop`. – Joseph Wood Jul 01 '16 at 18:48
  • @akash87, if you really wanted to know the efficiency of a particular part of your code, you could make that part a function, and use the [R profiler](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rprof.html). – Joseph Wood Jul 01 '16 at 21:40
2

Using non-equi joins from the current development version of data.table, v1.9.7,

require(data.table) # v1.9.7+
dt[dt, .(CLM_IDs = CLM_IDs[.N==1L]), on=.(ID, Date1<=Date1, Date2>=Date2), by=.EACHI]
#    ID      Date1      Date2 CLM_ID
# 1:  1 2014-01-01 2014-01-17 718182
# 2:  1 2014-03-20 2014-03-21 885362
# 3:  2 2015-02-27 2015-05-09 589999
# 4:  2 2015-05-10 2015-06-10 689959
# 5:  3 2016-05-01 2016-05-05 656696
# 6:  3 2016-05-06 2016-05-22 669625
# 7:  4 2015-02-21 2015-03-04 777777
# 8:  4 2015-02-01 2015-02-28 778952
# 9:  4 2015-03-01 2015-03-22 778965

What this does is, for each row in dt (the one inside of square bracket), it looks up which rows match in dt (on the outside) based on the condition provided to the on argument.

The matching row indices are returned iff the only match is a self-match (since the condition includes equality as well). This is done by CLM_IDs[.N == 1L], where .N holds the number of observations for each group.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Is there any way to install v1.9.7 without first installing `Rtools`. I ask because, I need administrator rights to install `Rtools`, which currently makes installing v1.9.7 impossible. If not, is there a different way of doing the above with v1.9.6? Thanks. – Joseph Wood Jun 30 '16 at 17:38
  • @JosephWood, unfortunately not yet :-(.. we plan to make windows binaries directly available at some point.. I'll ask Jan (another data.table contributor) about this.. as he's been working a lot on these things. – Arun Jul 01 '16 at 18:41
  • 1
    @JosephWood it is on TODO list, generally waiting for [r-appveyor#29](https://github.com/krlmlr/r-appveyor/issues/29). You could grab windows binaries from [appveyor build artifacts](https://ci.appveyor.com/project/Rdatatable/data-table/history) and install it from zip. Unfortunately due to [r-appveyor#69](https://github.com/krlmlr/r-appveyor/issues/69) binaries are 7 days old now, so doesn't include non-equi join and `.EACHI` yet. – jangorecki Jul 01 '16 at 18:51
  • @jangorecki, any idea when the new version will be completed and uploaded to CRAN? – Joseph Wood Jul 01 '16 at 20:04
  • @JosephWood unfortunately no – jangorecki Jul 01 '16 at 20:23
  • 1
    @jangorecki, that's cool. As an aside, I would like to extend my appreciation for all of the awesome work your team does with the `data.table` package. It is truly amazing!! – Joseph Wood Jul 01 '16 at 20:26
  • @Arun It has been a while since I visited this page and I have a couple of questions. First, why does `CLM_IDs` have an "s" at the end? Is this a typo? Should I `setkey` before I do this? Also, what is non-equi? – akash87 Aug 26 '16 at 15:13
  • Yes, it's a typo. No you don't need to. Read about `on` from `?data.table`. [Google for nonequi joins](https://www.google.be/search?q=nonequi+joins&oq=nonequi+joins&aqs=chrome..69i57j0l5.2503j0j7&sourceid=chrome&ie=UTF-8). – Arun Aug 26 '16 at 17:22
1

Here is a not-so-pretty solution comparing each row with the dates of all other rows. I corrected the one year 3015 to 2015. The results are different from what you are expecting, though. Either I misunderstood your question, or you misread the data.

Data:

dta <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), 
                      CLM_ID = c(718182L, 718184L, 885236L, 885362L, 589963L, 589999L, 594226L, 689959L, 656696L, 669625L, 777777L, 778952L, 778965L), 
                      Date1 = structure(c(1L, 3L, 2L, 9L, 8L, 6L, 10L, 12L, 11L, 13L, 5L, 4L, 7L), .Label = c("1/1/2014", "1/15/2014", "1/2/2014", "2/1/2015", "2/21/2015", "2/27/2015", "3/1/2015", "3/18/2015", "3/20/2014", "4/11/2015", "5/1/2016", "5/10/2015", "5/6/2016"), class = "factor"), 
                      Date2 = structure(c(1L, 2L, 1L, 4L, 5L, 10L, 7L, 11L, 9L, 8L, 6L, 3L, 5L), .Label = c("1/17/2014", "1/8/2014", "2/28/2015", "3/21/2014", "3/22/2015", "3/4/2015", "4/17/2015", "5/22/2016", "5/5/2016", "5/9/2015", "6/10/2015"), class = "factor")), 
                 .Names = c("ID", "CLM_ID", "Date1", "Date2"), class = "data.frame", 
                 row.names = c(NA, -13L))

Code:

dta$Date1 <- as.Date(dta$Date1, format = "%m/%d/%Y")
dta$Date2 <- as.Date(dta$Date2, format = "%m/%d/%Y")

# Boolean vector to memorize results
keep <- logical(length = nrow(dta))
for(i in 1:nrow(dta)) {
  match <- dta[dta$Date1 <= dta$Date1[i] & dta$Date2 >= dta$Date2[i], ]
  if(nrow(match) == 1) keep[i] <- TRUE
}

# Result
dta[keep, ]
sebastianmm
  • 1,148
  • 1
  • 8
  • 26