R - Join Dataframes using a key and then Approximate Dates

Question

The Problem

I am trying to merge two dataframes using 3 ID columns (Or 1 column, if I paste the 3 together), one of which is a datetime variable and can vary between the two dataframes by up to 1 second.

Background

I have two dataframes extracted from a library with transaction records. For some reason, the check-outs and the check-ins are recorded seperately, without a unique "transaction ID" to match them. I'd like to match them. The "check-out" dataframe has a record for each item that was checked-out, including the due date (when the item should be returned). The "check-in" dataframe has a record for each item that was checked-in, including the due date. Unfortunately, I am having a hard time merging these dataframes together for two reasons:

There is no unique transaction ID to match the tables. (Why? I don't know.)
The "due_date" field for each transaction can vary by up to a second for the same transaction.

The variation in due_date is occurs seemingly at random, so there isn't any way to determine for which records the two due_dates are equal or differ by 1 second. Otherwise, I could just subtract (or add) a second to make them equal.

The Data

Here is a sample of the data I am working with:

library(dplyr)
library(lubridate)

check_in <- tribble(
  ~ patron_id, ~item_id, ~checked_in, ~due_date,
    "A", "Z", "2018-04-16 07:00:00", "2018-04-16 08:00:00",
    "A", "Y", "2018-04-17 07:30:01", "2018-04-17 08:30:01",
    "B", "X", "2018-04-17 07:00:01", "2018-04-17 08:00:01",
    "B", "Z", "2018-04-17 08:00:01", "2018-04-17 09:00:01",
    "B", "Z", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "V", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "X", "2018-04-09 09:00:01", "2018-04-09 10:00:01")

check_out <- tribble(
  ~ patron_id, ~item_id, ~checked_out, ~due_date,
    "A", "Z", "2018-04-16 06:00:00", "2018-04-16 08:00:01",
    "A", "Y", "2018-04-17 06:30:01", "2018-04-17 08:30:00",
    "B", "X", "2018-04-17 06:00:01", "2018-04-17 08:00:00",
    "B", "Z", "2018-04-17 07:00:01", "2018-04-17 09:00:00",
    "B", "Z", "2018-04-09 08:00:01", "2018-04-09 10:00:01",
    "C", "V", "2018-04-09 08:00:01", "2018-04-09 10:00:01",
    "C", "X", "2018-04-09 08:00:01", "2018-04-09 10:00:00")

check_in$due_date <- ymd_hms(check_in$due_date)
check_in$checked_in <- ymd_hms(check_in$checked_in)

check_out$due_date <- ymd_hms(check_out$due_date)
check_out$checked_out <- ymd_hms(check_out$checked_out)

Patron ID is the unique ID of the person who checked out a book. The Item ID is the unique ID of the book. Checked Out is when the book was checked out. Checked In is when the book was checked in. And Due Date is when the book is due.

For this sample data, I made all of the due-dates equal to 2 hours after the check out date. I also made the check-in dates equal to 1 hour after the check out date.

Desired Output

I would like to take the "checked_in" variable from the check_in dataframe and match it to the appropriate transaction in the check_out dataframe. The output would be something like this, but perhaps with a some sort of generated transaction ID:

desired_output <- tribble(
  ~patron_id, ~item_id, ~checked_out, ~checked_in, ~due_date,
    "A", "Z", "2018-04-16 06:00:00", "2018-04-16 07:00:00", "2018-04-16 08:00:01",
    "A", "Y", "2018-04-17 06:30:01", "2018-04-17 07:30:01", "2018-04-17 08:30:00",
    "B", "X", "2018-04-17 06:00:01", "2018-04-17 07:00:01", "2018-04-17 08:00:00",
    "B", "Z", "2018-04-17 07:00:01", "2018-04-17 08:00:01", "2018-04-17 09:00:00",
    "B", "Z", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "V", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:01",
    "C", "X", "2018-04-09 08:00:01", "2018-04-09 09:00:01", "2018-04-09 10:00:00")

What I've tried

ATTEMPT #1:

I''ve tried to conditionally merge, as explained in this post, with the following modifications:

check_out <- check_out %>%
             mutate(transaction_id = paste(patron_id,"-",item_id,sep=""))
check_in <- check_in %>%
              mutate(transaction_id = paste(patron_id,"-",item_id,sep=""))

output <- merge(check_out, check_in, by="transaction_id")[abs(difftime(check_out$due_date, check_in$due_date, units = "secs"))<=1,]

But this method doesn't handle identical transaction ID's (obviously) and creates more records than there actually are.

ATTEMPT #2:

Reverting back to the original dataframes, I attempted the solution in this post, with the following modifications:

output <- cbind(check_out, check_in[ 
                  sapply(check_out$due_date, 
                    function(x) which.min(abs(difftime(x, check_in$due_date)))), ])

But this method does not consider the "transaction ID", or rather, the two key variables that I am using to create some sort of unique ID. And thus, get the output wrong.

OTHER UNSUCCESSFUL ATTEMPTS:

Fuzzy Joins as mentioned in this article. (And the other R based solutions mentioned.)
This response, which uses filtering.

Unfortunately, I wasn't able to get these to work. I wasn't confident in how the methods were working and it didn't produce what I wanted. Most likely a user error, because it seems others were able to get similar things to work.

Thanks

Thank you in advanced, if you are able to help me. I tend to use the tools provided by the Tidyverse, but I am open to using other tools and methods. I tried to make sure I did my due diligence when searching for other solutions, but if you find that I missed an important post, please mark this as duplicate and send that post my way.

Please let me know if I can provide any additional information or clarify any of the above details.

If the issue for merge on due_date is the extra minute you can take the modulo two minutes to be certain to have a match. > dueDate = as.POSIXct("2018-04-16 08:00:01", tz = "GMT") > dueDate - as.numeric(dueDate) %% 120 [1] "2018-04-16 08:00:00 GMT" — pepece, Apr 18 '18 at 18:38
Thank you for the response! While the sample data I provided tends to have clean numbers, like "2018-04-16 08:00:01", the actual data tends to be a little messier and I'm not sure if your method would work for each instance. For example, I think you would get different results for "2018-04-16 07:59:59" and "2018-04-16 08:00:00" even though they are only 1 second apart. Additionally, modulo essential rounds the number down, whereas I would like to keep the number of seconds (+/-1) accurate. Maybe I misunderstood your comment, though. Does my response seem appropriate to the method you suggest? — Christian Million, Apr 18 '18 at 19:18
Keep the idea, and then for rounding it is just a trick: mod = as.numeric(dueDate) %% 120 dueDate = dueDate - ifelse(mod<120/2, mod, mod-120) will map both :59 and :01 to :00. I let you figure out a clean implementation. — pepece, Apr 19 '18 at 05:49

score 1 · Accepted Answer · answered Apr 18 '18 at 19:49

1

Would go with this:

inner_join(check_in, check_out, by = c("patron_id", "item_id")) %>%
  filter(abs(difftime(due_date.y, due_date.x, units= "secs"))<=as.difftime(1, format = "%S", units = "secs"))

Explanation: simple join + filtering the rows with a time difference <= 1 second

answered Apr 18 '18 at 19:49

Fierr

185
2
10

Thank you. This ended up giving me the result I wanted. – Christian Million Apr 22 '18 at 23:10

R - Join Dataframes using a key and then Approximate Dates

1 Answers1

Linked