Merging two Data frames with fuzzy merge/sqldf

Question

I have the following dataframes (df11 and df22) I'd like to do a merge/full join on with "UserID=UserID" and date difference <= 30 . So if the UserIDs match up AND the date's are less than or equal to 30, I'd like them merged into one singular row. I've looked at fuzzy join here and sqldf here but I can't figure out how to implement either of those for my data frames.

df1 <- structure(list(UserID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L), 
                      Full.Name = c( "John Smith", "Jack Peters", "Bob Brown", "Jane Doe", "Jackie Jane", "Sarah Brown", "Chloe Brown", "John Smith" ), 
                      Info = c("yes", "no", "yes", "yes", "yes", "yes", "no", "yes"), 
                      EncounterID = c(13L, 14L, 15L, 16L, 17L, 18L, 19L, 13L), DateTime = c("1/2/21 00:00", "1/5/21 12:00", "1/1/21 1:31", "1/5/21 3:34", "5/9/21 5:33", "5/8/21 3:39", "12/12/21 2:30", "12/11/21 9:21"), 
                      Temp = c("100", "103", "104", "103", "101", "102", "103", "105"), 
 
                      misc = c("(null)", "no", "(null)", "(null)", "(null)","(null)", "(null)", "(null)" 
                                    )), 
                 class = "data.frame", row.names = c(NA, 
                                                     -8L))

df2 <- structure(list(UserID = c(1L, 2L, 3L, 4L, 5L, 6L), 
                      Full.Name = c("John Smith", "Jack Peters", "Bob Brown", "Jane Doe", "Jackie Jane", "Sarah Brown"), 
                      DOB = c("1/1/90", "1/10/90", "1/2/90", "2/20/80", "2/2/80", "12/2/80"), 
                      EncounterID = c(13L, 14L, 15L, 16L, 17L, 18L), EncounterDate = c("1/1/21", "1/2/21", "1/1/21", "1/6/21", "5/7/21", "5/8/21"), 
                      Type = c("Intro", "Intro", "Intro", "Intro", "Care", "Out"), 
                      responses = c("(null)", "no", 
                                    "yes", "no", "no", "unsat")), 
                      
                 class = "data.frame", row.names = c(NA, 
                                                     -6L))
loadedNamespaces()
install.packages("Rcpp")
library(dplyr)
library(tidyr)
install.packages("lubridate")
library(lubridate)

df11 <- 
df1 %>% 
  separate(DateTime, c("Date", "Time"), sep=" ") %>% 
  mutate(Date = as_datetime(mdy(Date))) %>% 
  select(-Time) %>% 
  as_tibble()

df22 <-
df2 %>% 
  mutate(across(c(EncounterDate), mdy)) %>% 
  mutate(across(c(EncounterDate), as_datetime)) %>% 
  as_tibble()

@r2evans After running the first set of code, I get the following output. Which is slightly different from yours.

df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# A tibble: 8 x 7
  UserID Full.Name   Info  EncounterID Date                Temp  misc  
   <int> <chr>       <chr>       <int> <dttm>              <chr> <chr> 
1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null)
2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no    
3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null)
4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null)
5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null)
6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null)
7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null)
8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null)

It's great that you provided the output from `dput`, thank you! It's an uncommon perk in questions. However ... the conversion of columns to `Date`-class or `POSIXct`-class objects should have nothing to do with your merging issues. Please *minimize* your question, showing only code that is relevant. (That is, update the columns and redo the `dput` portion. Don't include `loadedNamespaces` or `install.packages`. Keep the `library` calls.) Thanks! — r2evans, Nov 05 '21 at 17:51
There is something really broken if you `mutate` to add `Date_m30` and it does not appear in the output. — r2evans, Nov 05 '21 at 19:28
I've tried it on 3 machines and I've gotten # A tibble: 8 x 7 on each one of them. Is there another way of doing this? — Ash S, Nov 05 '21 at 21:39
I have no idea why `mutate` would not return the additional columns. Try `df11$Date %m-% days(30)` and see if it returns what you expect. If yes, check that the `+`-variant also works. If yes, then check for typos in the name of the stored object, perhaps `df11` or `dfl1` (lower-`L` versus one-`1`), or something else that could possibly be causing this dilemma. — r2evans, Nov 05 '21 at 21:42
I re ran all the packages and libraries and it worked. You're amazing. Thank you so much. — Ash S, Nov 07 '21 at 03:28

score 1 · Accepted Answer · answered Nov 05 '21 at 17:59

One way is to first create "+/- 30 day" columns in one of them, then do a standard date-range join. Using sqldf:

Prep:

library(dplyr)
df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# # A tibble: 8 x 9
#   UserID Full.Name   Info  EncounterID Date                Temp  misc   Date_m30            Date_p30           
#    <int> <chr>       <chr>       <int> <dttm>              <chr> <chr>  <dttm>              <dttm>             
# 1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null) 2020-12-03 00:00:00 2021-02-01 00:00:00
# 2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no     2020-12-06 00:00:00 2021-02-04 00:00:00
# 3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null) 2020-12-02 00:00:00 2021-01-31 00:00:00
# 4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null) 2020-12-06 00:00:00 2021-02-04 00:00:00
# 5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null) 2021-04-09 00:00:00 2021-06-08 00:00:00
# 6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null) 2021-04-08 00:00:00 2021-06-07 00:00:00
# 7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null) 2021-11-12 00:00:00 2022-01-11 00:00:00
# 8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null) 2021-11-11 00:00:00 2022-01-10 00:00:00

The join:

sqldf::sqldf("
    select df11.*, df22.DOB, df22.EncounterDate, df22.Type, df22.responses
    from df11
      left join df22 on df11.UserID = df22.UserID
        and df22.EncounterDate between df11.Date_m30 and df11.Date_p30") %>%
  select(-Date_m30, -Date_p30)
#   UserID   Full.Name Info EncounterID                Date Temp   misc     DOB       EncounterDate  Type responses
# 1      1  John Smith  yes          13 2021-01-01 19:00:00  100 (null)  1/1/90 2020-12-31 19:00:00 Intro    (null)
# 2      2 Jack Peters   no          14 2021-01-04 19:00:00  103     no 1/10/90 2021-01-01 19:00:00 Intro        no
# 3      3   Bob Brown  yes          15 2020-12-31 19:00:00  104 (null)  1/2/90 2020-12-31 19:00:00 Intro       yes
# 4      4    Jane Doe  yes          16 2021-01-04 19:00:00  103 (null) 2/20/80 2021-01-05 19:00:00 Intro        no
# 5      5 Jackie Jane  yes          17 2021-05-08 20:00:00  101 (null)  2/2/80 2021-05-06 20:00:00  Care        no
# 6      6 Sarah Brown  yes          18 2021-05-07 20:00:00  102 (null) 12/2/80 2021-05-07 20:00:00   Out     unsat
# 7      7 Chloe Brown   no          19 2021-12-11 19:00:00  103 (null)    <NA>                <NA>  <NA>      <NA>
# 8      1  John Smith  yes          13 2021-12-10 19:00:00  105 (null)    <NA>                <NA>  <NA>      <NA>

Thank you for your help. I edited my post in the end to show my output with your first set of code. It won't work. My tibble is 8 x 7 whereas yours is 8 x 9. Do you know why? — Ash S, Nov 05 '21 at 19:18

Merging two Data frames with fuzzy merge/sqldf

1 Answers1