Joining data frames on unique id and conditional dates

Question

I have two df I want to join

The first is a df with an employee's work schedule, with corresponding dates when the schedule was effective (effective date)

>df1 = schedule and effective date
ID      Date        Wrk_Schd
0001    8/16/2002   80.00
0001    2/27/2004   40.00
0001    2/1/2006    50.00
0001    7/1/2017    36.00

The second is a df with pay period dates and actual hours worked

>df2 = pay periods and actual hours
ID      Date        Wrk_Hrs
0001    9/9/2003    32.00
0001    10/8/2005   35.00   
0001    10/21/2006  35.00
0001    12/21/2007  35.00   
0001    9/9/2012    40.00
0001    10/9/2013   40.00
0001    12/9/2017   36.00
0001    12/21/2017  36.00

How would I join on ID and date so that the df with actual hours worked matches the work schedule for the applicable effective date?

Keep in mind the dates in df1 and df2 are not exactly the same. So the solution I am looking for would join based on whether the pay period was after the effective date, on the condition that there is not another effective date that might apply.

The desired result is as follows

>df3 
ID      Date        Wrk_Hrs Wrk_Schd
0001    9/9/2003    32.00   80.00
0001    10/8/2005   35.00   40.00
0001    10/21/2006  35.00   50.00
0001    12/21/2007  35.00   50.00
0001    9/9/2012    40.00   50.00
0001    10/9/2013   40.00   50.00
0001    12/9/2017   36.00   36.00
0001    12/21/2017  36.00   36.00

Can you put the desired result in your post? It seems that `merge(df1, df2, by=c("ID","Date"))` would do it — rvbarreto, Jan 26 '18 at 19:13
@rvbarreto the issue with that approach is that the effective dates and pay period dates are not exactly the same. — samuelt, Jan 26 '18 at 19:20
Using `sqldf` package will make things easier and simpler. Do you want me to give you an example? — MKR, Jan 26 '18 at 19:24
@MKR thanks for the suggestion. I have seen that package in other examples. Is there a method you would suggest when using that package? I am having a hard time conceptualizing the approach. I am thinking I would match if effective date is after the pay period date, but how do I avoid an overwrite by the max effective date if I have multiple instances of it? — samuelt, Jan 26 '18 at 19:38
@Onyambu that will provide me with the non-matched rows but I don't see how that facilitates a correct join — samuelt, Jan 26 '18 at 19:43
I just saw you updated your post. Now it is not just based on year. I will update my answer when I have time. — www, Jan 26 '18 at 19:45
I'm having a hard time to get things together. The worker was supposed to work at 8/16/2002, but he just worked at 9/9/2003, more than a year latter? This is what I can conclude from your desired result for df3. This is important so we can know what logic should be used to join the data. — rvbarreto, Jan 26 '18 at 19:54
you might find a solution using [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) — Eric Fail, Jan 26 '18 at 19:56
Clarifying. How do you know that row 1 at df1 should be joined with row 1 at df2? what logic do you use? — rvbarreto, Jan 26 '18 at 20:03
@rvbaretto this is just a df for representational purposes. The larger dataset has thousands of pay periods. So in between the work schedule date and effective date and the pay period dates there could have been numerous pay periods. That may or may not be important to the core problem, which is for any given pay period I need to join the work schedule for the applicable effective date. — samuelt, Jan 26 '18 at 20:07
But in a huge data set, how to you look at one row in one table and relate it to a row in the other table? What method do you use? Is there a key field, is it the position? I mean, how to you look at a date, an worker ID and worked hours and related to a table with another date, the same worker ID and scheduled hours? If you were going to do it manually, how would you do it? — rvbarreto, Jan 26 '18 at 20:09
@rvbarreto Say I am looking at df2. I see that a pay period date for worker 0001 was 12/9/17 and they worked 36h. I then look at df1 to reference what the scheduled hours were for that period. I see that worker 001's work schedule was updated on 7/1/2017, so the scheduled hours were 36h. For pay period of 12/9/17 worker 001 worked 36 of their 36 scheduled hours. — samuelt, Jan 26 '18 at 20:23
@samuelt So you want that values from df2 (e.g. 9/9/2003 32.00) with date between two consecutive periods at df1 (e.g. 8/16/2002 and 2/27/2004) to be assigned to the row at df1 with the smaller one (8/16/2002)? — rvbarreto, Jan 26 '18 at 21:06
@samuelt I have added one solution which may help you. Have a look. — MKR, Jan 27 '18 at 00:12

MKR · Accepted Answer · 2018-01-27T18:56:02.327

One possible solution could be using dplyr and sqldf.

# The data
df1 <- read.table(text = "ID      Date        Wrk_Schd
0001    08/16/2003   80.00
0001    02/27/2004   40.00
0001    02/01/2006    50.00
0001    07/01/2017    36.00", header = TRUE, stringsAsFactors = FALSE)

# Change Date column to date type
df1$Date <- as.Date(df1$Date, "%m/%d/%Y")

df2 <- read.table(text = "ID      Date        Wrk_Hrs
0001    09/09/2003    32.00
0001    10/08/2005   35.00   
0001    10/21/2006  35.00
0001    12/21/2007  35.00   
0001    09/09/2012    40.00
0001    10/09/2013   40.00
0001    12/09/2017   36.00
0001    12/21/2017  36.00", header = TRUE, stringsAsFactors = FALSE)

# Change Date column to date type 
df2$Date <- as.Date(df2$Date, "%m/%d/%Y")


library(dplyr)
library(sqldf)
# Use lead function to add a column that show previous day of the next
  schedule date
df1_Mod <- df1 %>%
  arrange(ID, Date) %>%
  group_by(ID) %>%
  mutate(End_Date = lead(Date) - 1)

df1_Mod
#  ID       Date Wrk_Schd   End_Date
#1  1 2003-08-16       80 2004-02-26
#2  1 2004-02-27       40 2006-01-31
#3  1 2006-02-01       50 2017-06-30
#4  1 2017-07-01       36       <NA>

#Join data.frames based on ID and Date between Date and End_Date

df3 <- sqldf("SELECT df2.ID, df2.Date, df2.Wrk_Hrs, df1_Mod.Wrk_Schd 
              FROM df2, df1_Mod
              WHERE df2.ID = df1_Mod.ID AND
              df2.Date >= df1_Mod.Date AND
             (df1_Mod.End_Date IS NULL OR df2.Date <= df1_Mod.End_Date)")

df3
#  ID       Date Wrk_Hrs Wrk_Schd
#1  1 2003-09-09      32       80
#2  1 2005-10-08      35       40
#3  1 2006-10-21      35       50
#4  1 2007-12-21      35       50
#5  1 2012-09-09      40       50
#6  1 2013-10-09      40       50
#7  1 2017-12-09      36       36
#8  1 2017-12-21      36       36

I tried your code for the database with more than one ID that I built in my answer and, at the line you define df1_Mod, the result was not the expected. It produced an `End_Date` lower than `Date` at the line where ID 1 ends and ID 2 start. Check it out. — rvbarreto, Jan 27 '18 at 17:08
@rvbarreto Excellent pick mate. I had forgotten to copy `group_by` clause. — MKR, Jan 27 '18 at 18:57
Elegant way to join data together at the line you did the piping. — rvbarreto, Jan 27 '18 at 19:40

rvbarreto · Answer 2 · 2018-01-27T19:30:17.513

This code will do the trick. I changed df1$Date to df1$start.date, so the code will be clearer.

library(lubridate)
df1 <- data.frame(ID=c(1, 1, 1, 1),
                  start.date=c("8/16/2003", "2/27/2004",
                               "2/1/2006", "7/1/2017"),
                  Wrk_Schd=c(80.00, 40.00, 50.00, 36.00))

df2 <- data.frame(ID=c(1, 1, 1, 1, 1, 1, 1, 1),
                  Date=c("9/9/2003", "10/8/2005",
                         "10/21/2006", "12/21/2007",
                         "9/9/2012", "10/9/2013",
                         "12/9/2017", "12/21/2017"),
                  Wrk_Hrs=c(32.00, 35.00, 35.00, 35.00,
                            40.00, 40.00, 36.00, 36.00))

df1$start.date <- as.Date(df1$start.date, "%m/%d/%Y")
df2$Date <- as.Date(df2$Date, "%m/%d/%Y")

## These lines are just to expand the data to have
## more than one ID so the code can be better tested
## They can be discarding without affecting the code
df1 <- rbind(df1,data.frame(ID=df1$ID+1, start.date=df1$start.date+1, Wrk_Schd=df1$Wrk_Schd+10))
df2 <- rbind(df2,data.frame(ID=df2$ID+1, Date=df2$Date+1, Wrk_Hrs=df2$Wrk_Hrs+1))

## order and set end of periods
df1 <- df1[order(df1$ID,df1$start.date),] #order data
df1$end.date <- c(df1$start.date[-1]-1, today())
df1$end.date[df1$ID!=c(df1$ID[-1],df1$ID[1])] <- today() #set end of periods

##  ID start.date Wrk_Schd   end.date
##1  1 2003-08-16       80 2004-02-26
##2  1 2004-02-27       40 2006-01-31
##3  1 2006-02-01       50 2017-06-30
##4  1 2017-07-01       36 2018-01-27
##5  2 2003-08-17       90 2004-02-27
##6  2 2004-02-28       50 2006-02-01
##7  2 2006-02-02       60 2017-07-01
##8  2 2017-07-02       46 2018-01-27

## Assing Wrk_Schd to each of the jobs
df2$Wrk_Schd <-apply(df2, 1, function(x)(
    df1$Wrk_Schd[which((df1$start.date<x["Date"])&
                       (df1$end.date>x["Date"])&
                       (df1$ID==x["ID"]))]
))

##   ID       Date Wrk_Hrs Wrk_Schd
##1   1 2003-09-09      32       80
##2   1 2005-10-08      35       40
##3   1 2006-10-21      35       50
##4   1 2007-12-21      35       50
##5   1 2012-09-09      40       50
##6   1 2013-10-09      40       50
##7   1 2017-12-09      36       36
##8   1 2017-12-21      36       36
##9   2 2003-09-10      33       90
##10  2 2005-10-09      36       50
##11  2 2006-10-22      36       60
##12  2 2007-12-22      36       60
##13  2 2012-09-10      41       60
##14  2 2013-10-10      41       60
##15  2 2017-12-10      37       46
##16  2 2017-12-22      37       46

I will also suggest another solution that may save a further step for you: Summing up the works which belong to a period.

df1$Total_Wrk_Hrs <- apply(df1,1,function(x)(
    sum(df2$Wrk_Hrs[which((x["start.date"]<df2$Date)&
                          (x["end.date"]>df2$Date)&
                          (x["ID"]==df2$ID))]
        )
))

##  ID start.date Wrk_Schd   end.date Total_Wrk_Hrs
##1  1 2003-08-16       80 2004-02-26         32
##2  1 2004-02-27       40 2006-01-31         35
##3  1 2006-02-01       50 2017-06-30        150
##4  1 2017-07-01       36 2018-01-27         72
##5  2 2003-08-17       90 2004-02-27         33
##6  2 2004-02-28       50 2006-02-01         36
##7  2 2006-02-02       60 2017-07-01        154
##8  2 2017-07-02       46 2018-01-27         74

Finally, I noticed you are not using Date format. You should convert it to be able compare dates. You can do it by running this code in every date column which is still not a date

df2$Date <- as.Date(df2$Date , format="%m/%d/%Y")
class(df2$Date) #check if is date
#[1] "Date"

Also, you should always provide reproducible codes. This means that by simply coping your code and running, I should be able to reproduce the error. Take a look at How to make a great R reproducible example? for more on this. By observing this, you increase (a lot) the changes of having a reply.

Good solution but you haven't considered ID while merging data. How will it work where there are for many different IDs — MKR, Jan 27 '18 at 00:37
You are right again! I fixed the code and also added another possibility for OP — rvbarreto, Jan 27 '18 at 16:31
Good work. Now it will work. My +1 for use of logic `df1$ID!=c(df1$ID[-1],df1$ID[1])`. One minor point please include `lubridate` package in your code otherwise `today()` will not work. — MKR, Jan 27 '18 at 19:11
Nice one, I fixed it. I'm not sure yet if it is better to let it as `today()` or as `NA`, as you did, but I kept it to provide Op with a different alternative than the one you did — rvbarreto, Jan 27 '18 at 19:37

Onyambu · Answer 3 · 2018-01-26T20:55:16.750

You will have to create a new column that has both the effective and period dates and use that to merge the two dfs, then you can remove the columns if you want to

 df1$Date=as.Date(df1$Date,format="%m/%d/%Y")
 df2$Date=as.Date(df2$Date,format="%m/%d/%Y")
 s=colSums(outer(df1$Date,df2$Date,`<`))
 df3=df1[s,]
 df3$d=df2$d=paste(df1$Date[s],df2$Date)
 df=merge(df2,df3[-2])
 df[order(df$Date),-2]
 ID       Date Wrk_Hrs Wrk_Schd
1  1 2003-09-09      32       80
2  1 2005-10-08      35       40
3  1 2006-10-21      35       50
4  1 2007-12-21      35       50
5  1 2012-09-09      40       50
6  1 2013-10-09      40       50
7  1 2017-12-09      36       36
8  1 2017-12-21      36       36

Joining data frames on unique id and conditional dates

3 Answers3