The Problem
I am trying to merge two dataframes using 3 ID columns (Or 1 column, if I paste the 3 together), one of which is a datetime variable and can vary between the two dataframes by up to 1 second.
Background
I have two dataframes extracted from a library with transaction records. For some reason, the check-outs and the check-ins are recorded seperately, without a unique "transaction ID" to match them. I'd like to match them. The "check-out" dataframe has a record for each item that was checked-out, including the due date (when the item should be returned). The "check-in" dataframe has a record for each item that was checked-in, including the due date. Unfortunately, I am having a hard time merging these dataframes together for two reasons:
- There is no unique transaction ID to match the tables. (Why? I don't know.)
- The "due_date" field for each transaction can vary by up to a second for the same transaction.
The variation in due_date is occurs seemingly at random, so there isn't any way to determine for which records the two due_dates are equal or differ by 1 second. Otherwise, I could just subtract (or add) a second to make them equal.
The Data
Here is a sample of the data I am working with:
library(dplyr)
library(lubridate)
check_in <- tribble(
~ patron_id, ~item_id, ~checked_in, ~due_date,
"A", "Z", "2018-04-16 07:00:00", "2018-04-16 08:00:00",
"A", "Y", "2018-04-17 07:30:01", "2018-04-17 08:30:01",
"B", "X", "2018-04-17 07:00:01", "2018-04-17 08:00:01",
"B", "Z", "2018-04-17 08:00:01", "2018-04-17 09:00:01",
"B", "Z", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
"C", "V", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
"C", "X", "2018-04-09 09:00:01", "2018-04-09 10:00:01")
check_out <- tribble(
~ patron_id, ~item_id, ~checked_out, ~due_date,
"A", "Z", "2018-04-16 06:00:00", "2018-04-16 08:00:01",
"A", "Y", "2018-04-17 06:30:01", "2018-04-17 08:30:00",
"B", "X", "2018-04-17 06:00:01", "2018-04-17 08:00:00",
"B", "Z", "2018-04-17 07:00:01", "2018-04-17 09:00:00",
"B", "Z", "2018-04-09 08:00:01", "2018-04-09 10:00:01",
"C", "V", "2018-04-09 08:00:01", "2018-04-09 10:00:01",
"C", "X", "2018-04-09 08:00:01", "2018-04-09 10:00:00")
check_in$due_date <- ymd_hms(check_in$due_date)
check_in$checked_in <- ymd_hms(check_in$checked_in)
check_out$due_date <- ymd_hms(check_out$due_date)
check_out$checked_out <- ymd_hms(check_out$checked_out)
Patron ID is the unique ID of the person who checked out a book. The Item ID is the unique ID of the book. Checked Out is when the book was checked out. Checked In is when the book was checked in. And Due Date is when the book is due.
For this sample data, I made all of the due-dates equal to 2 hours after the check out date. I also made the check-in dates equal to 1 hour after the check out date.
Desired Output
I would like to take the "checked_in" variable from the check_in dataframe and match it to the appropriate transaction in the check_out dataframe. The output would be something like this, but perhaps with a some sort of generated transaction ID:
desired_output <- tribble(
~patron_id, ~item_id, ~checked_out, ~checked_in, ~due_date,
"A", "Z", "2018-04-16 06:00:00", "2018-04-16 07:00:00", "2018-04-16 08:00:01",
"A", "Y", "2018-04-17 06:30:01", "2018-04-17 07:30:01", "2018-04-17 08:30:00",
"B", "X", "2018-04-17 06:00:01", "2018-04-17 07:00:01", "2018-04-17 08:00:00",
"B", "Z", "2018-04-17 07:00:01", "2018-04-17 08:00:01", "2018-04-17 09:00:00",
"B", "Z", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
"C", "V", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
"C", "X", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:00")
What I've tried
ATTEMPT #1:
I''ve tried to conditionally merge, as explained in this post, with the following modifications:
check_out <- check_out %>%
mutate(transaction_id = paste(patron_id,"-",item_id,sep=""))
check_in <- check_in %>%
mutate(transaction_id = paste(patron_id,"-",item_id,sep=""))
output <- merge(check_out, check_in, by="transaction_id")[abs(difftime(check_out$due_date, check_in$due_date, units = "secs"))<=1,]
But this method doesn't handle identical transaction ID's (obviously) and creates more records than there actually are.
ATTEMPT #2:
Reverting back to the original dataframes, I attempted the solution in this post, with the following modifications:
output <- cbind(check_out, check_in[
sapply(check_out$due_date,
function(x) which.min(abs(difftime(x, check_in$due_date)))), ])
But this method does not consider the "transaction ID", or rather, the two key variables that I am using to create some sort of unique ID. And thus, get the output wrong.
OTHER UNSUCCESSFUL ATTEMPTS:
- Fuzzy Joins as mentioned in this article. (And the other R based solutions mentioned.)
- This response, which uses filtering.
Unfortunately, I wasn't able to get these to work. I wasn't confident in how the methods were working and it didn't produce what I wanted. Most likely a user error, because it seems others were able to get similar things to work.
Thanks
Thank you in advanced, if you are able to help me. I tend to use the tools provided by the Tidyverse, but I am open to using other tools and methods. I tried to make sure I did my due diligence when searching for other solutions, but if you find that I missed an important post, please mark this as duplicate and send that post my way.
Please let me know if I can provide any additional information or clarify any of the above details.