0

I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).

I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).

> df1 <- data.frame(ID=c(1,1,1,2,2,3,4), 
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), 
+                   exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019', 
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
  ID enter.date  exit.date
1  1 2015-05-07 2015-07-01
2  1 2015-07-10 2015-10-15
3  1 2017-08-25 2017-09-03
4  2 2016-09-01 2016-09-30
5  2 2018-01-05 2019-06-01
6  3 2016-05-01 2017-05-01
7  4 2017-04-08 2017-06-08

and the other has "eval" dates.

> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015', 
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
  ID  eval.date
1  1 2015-10-30
2  2 2016-10-10
3  2 2019-09-10
4  3 2018-05-15
5  4 2015-01-19

I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.

In other words, I'm trying to get an output that looks like this from the two dataframes above.

> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'), 
+                   exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+                   assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
  ID enter.date  exit.date assess.date time.diff
1  1 2015-07-10 2015-10-15  2015-10-30  -15 days
2  2 2016-09-01 2016-09-30  2016-10-10  -10 days
3  2 2018-01-05 2019-06-01  2019-09-10 -101 days
4  3 2016-05-01 2017-05-01  2018-05-15 -379 days

Once I perform the merge finding the averages is easy enough with

> aggregate(df3[,5], list(df3$ID), mean)
  Group.1       x
1       1  -15.0 
2       2  -55.5 
3       3 -379.0

but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!

A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.

AdamB
  • 1
  • 2

1 Answers1

0

Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:

df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019', 
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;

df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015', 
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;

library(sqldf)

df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
    df$next.date = as.Date('2100-12-31')
    if (nrow(df) > 1)
        df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
    df
}), df1$ID)


sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
  from df1, df2
  where df1.ID == df2.ID
    and df2."eval.date" between df1."exit.date"
    and df1."next.date"')


  ID enter.date  exit.date  next.date ID..5  eval.date time.diff
1  1 2015-07-10 2015-10-15 2017-08-25     1 2015-10-30       -15
2  2 2016-09-01 2016-09-30 2018-01-05     2 2016-10-10       -10
3  2 2018-01-05 2019-06-01 2100-12-31     2 2019-09-10      -101
4  3 2016-05-01 2017-05-01 2100-12-31     3 2018-05-15      -379
BigFinger
  • 1,033
  • 6
  • 7