How do I merge data frames (and duplicate values) by the nearest date for each individual ID?

Question

I have two data frames that I am trying to join by date (grouped by individual).

I have made example data frames of both (the real df1 is 5700 rows, and the real df2 is 287 rows).

df1 has IDs (including some not in df2), dates, and behavior values.

df2 has IDs (though fewer than df1), dates (fewer than those in df1), and hormone values.

My goal is to match the hormones for a given individual from the nearest date in df2 to the nearest date in df1 (matching as closely as possible but only duplicating the values of hormones from df2 in df1 when the nearest dates are less than or equal to 2 days apart).

I would like to have the hormones that don't match a behavioral observation printed at the bottom of the new data frame with their date such that they aren't lost (example in df3)

df1
ID    Date        behavior
a     1-12-2020   0
b     1-12-2020   1
b     1-13-2020   1
c     1-12-2020   2
d     1-12-2020   0
c     1-13-2020   1
c     1-14-2020   0
c     1-15-2020   1
c     1-16-2020   2

df2
ID    Date        hormone
a     1-10-2020   20
b     1-18-2019   70
c     1-10-2020   80
c     1-16-2020   90

#goal dataframe

df3
ID    Date        behavior hormone
a     1-12-2020   0        20
b     1-12-2020   1        NA [> 2 days from hormone]
b     1-13-2020   1        NA [> 2 days from hormone]
c     1-12-2020   2        80
d     1-12-2020   0        NA [no matching individual in df2]
c     1-13-2020   1        NA [> 2 days from hormone]
c     1-14-2020   0        90
c     1-15-2020   1        90
c     1-16-2020   2        90
b     1-18-2019   NA       70 [unmatched hormone at bottom of df3]

here is the code to create these data frames:

df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
                 date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )

df2 <- data.frame(ID = c("a", "b", "c", "c"),
                 date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )

df1$date<-as.factor(df1$date)
df1$date<-strptime(df1$date,format="%m-%d-%Y")
#for nearest date function to work
df1$date<-as.Date(df1$date,"%m/%d/%y")

df2$date<-as.factor(df2$date)
df2$date<-strptime(df2$date,format="%m-%d-%Y")
#for nearest date function to work
df2$date<-as.Date(df2$date,"%m/%d/%y")

I have been able to use a function from a previous question on the forum (link and code below) to match the nearest dates and duplicate to fill, but am not able to limit the time frame of matches, or print unmatched dates in new rows. Is there a way to do this?

This is what I started working from (code below): How to match by nearest date from two data frames?

# Function to get the index specifying closest or after
Ind_closest_or_after <- function(d1, d2){
  which.min(ifelse(d1 - d2 < 0, Inf, d1 - d2))
}

# Calculate the indices
closest_or_after_ind <- map_int(.x = df1$date, .f = Ind_closest_or_after, d2 = df2$date)

# Add index columns to the data frames and join
df2 <- df2 %>% 
  mutate(ind = 1:nrow(df2))

df1 <- df1 %>% 
  mutate(ind = closest_or_after_ind)

df3<-left_join(df2, df1, by = 'ind')

This answer seems the closest but doesn't limit the values: Merge two data frames by nearest date and ID

#function to do all but limit dates and print unmatched
library(data.table)
setDT(df2)[, date := date]
df2[df1, on = .(ID, date = date), roll = 'nearest']

danlooo · Answer 1 · 2022-02-18T15:53:20.120

You can join the tables by filtering all possible combinations (cross product using expand_grid):

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
                  date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
                  behavior = c(0,1,1,2,0,1,0,1,2) )

df2 <- data.frame(ID = c("a", "b", "c", "c"),
                  date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
                  hormone = c(20,70,80,90) )

joined <-
  df1 %>%
  rename_all(~ paste0(., ".1")) %>%
  expand_grid(df2 %>% rename_all(~ paste0(., ".2"))) %>%
  mutate(across(starts_with("date"), ~ .x %>%  parse_date(format = "%m-%d-%Y"))) %>%
  mutate(time_diff = abs(date.1 - date.2)) %>%
  filter(time_diff <= days(2) & ID.1 == ID.2) %>%
  select(ID = ID.1, behavior = behavior.1, hormone = hormone.2)
joined
#> # A tibble: 5 x 3
#>   ID    behavior hormone
#>   <chr>    <dbl>   <dbl>
#> 1 a            0      20
#> 2 c            2      80
#> 3 c            0      90
#> 4 c            1      90
#> 5 c            2      90

df1 %>%
  left_join(joined) %>%
  full_join(df2) %>%
  as_tibble() %>%
  distinct(ID, behavior, .keep_all = TRUE) %>%
  arrange(ID, behavior)
#> Joining, by = c("ID", "behavior")
#> Joining, by = c("ID", "date", "hormone")
#> # A tibble: 9 x 4
#>   ID    date      behavior hormone
#>   <chr> <chr>        <dbl>   <dbl>
#> 1 a     1-12-2020        0      20
#> 2 a     1-10-2020       NA      20
#> 3 b     1-12-2020        1      NA
#> 4 b     1-18-2019       NA      70
#> 5 c     1-14-2020        0      90
#> 6 c     1-13-2020        1      90
#> 7 c     1-12-2020        2      80
#> 8 c     1-10-2020       NA      80
#> 9 d     1-12-2020        0      NA

^{Created on 2022-02-18 by the reprex package (v2.0.0)}

This will result in one row for each (ID, behavior) pair. You can replace this e.g. with ID, date to have only one time point at any given time point for each ID.

Ah thank you so much! this is near perfect! Thank you again. The one thing that looks wrong here is that row 5 in the final data frame shouldn't exist (its row 4 duplicated).... is there a reason it duplicated row 4 and put the hormones for a date more than two days away? — CSStat, Feb 16 '22 at 17:43
It is not duplicated. Its because (c, 1-12-2020, 2) matches (c, 1-10-2020, 80) and (c, 1-16-2020, 2) matches (c, 1-16-2020, 90). You have also two c-2 rows in your df3 — danlooo, Feb 16 '22 at 18:41
Absolutely, I see what you are saying, and thank you for taking the time to reply. The problem is that there are 2 rows of c with behavior=2 (1-12-2020 and 1-16-2020) as you said, but in the final data frame with your method, there are 4 rows of c with behavior=2. It should match up exactly like you described and should not be duplicated where 1/12 has a row of 80 and a row of 90, and 1/16 has a row of 80 and a row of 90. Instead, it should be c for 1/12 with behavior 2 has one row with hormone 80, and c for 1/16 with behavior 2 has one row with hormone 90. — CSStat, Feb 18 '22 at 15:44
Ah thank you so much! I am still a little lost as now the c=2 for 1/16 doesn't exist at all in the final version. The separate dates are distinct (even though the same amounts of behavior are observed) and need to be kept. Ideally there shouldn't be any removing or duplicating of the rows from df1. I am only trying to merge/add the info of df2 (new rows should be made only if info of df2 doesn't match like your final 2,4,and8 which are great). — CSStat, Feb 18 '22 at 20:25

How do I merge data frames (and duplicate values) by the nearest date for each individual ID?

1 Answers1