1

I am working with cattle fertility data. In one table (data frame), what I have is a record of all the services performed in a cow (like inseminations). In a different table, I get the pregnancy diagnosis (positive or negative). Both have an unique ID (animal_id). My challenge has been successfully merging both tables in the right data range, meaning what I need is the pregnancy check associated with the right insemination record. Here is a sample of how both tables look like,

animal_id     service_date
610710        2005-10-22
610710        2006-12-03
610710        2006-12-27
610710        2007-12-02
610710        2008-01-17
610710        2008-03-04

The other table is the same but with a different date (event_date) and the diagnosis,

 animal_id     event_date        event_description
    610710     2006-06-16           PP
    610710     2007-02-15           PP
    610710     2008-01-09           PN
    610710     2008-04-09           PP
    610710     2009-06-16           PP

So what I would like to do is merge both tables in a way the dates complement each other, meaning if a service was performed on 2005-10-12, when I join both tables this row will link to the closest date in the Events table, and by closest I also mean later - since insemination happens before diagnosis. So the desired output would be something like this,

    animal_id    service_date       event_date     event_description
 1   610710       2005-10-22              NA               NA
 2   610710           NA              2006-06-16           PP
 3   610710       2006-12-03          2007-02-15           PP
 4   610710       2006-12-27          2007-02-15           PP
 5   610710       2007-12-02          2008-01-09           PN
 6   610710       2008-01-17          2008-04-09           PP
 7   610710       2008-03-04              NA               NA  
 8   610710           NA              2009-06-16           PP 

In the final output, I would expect a large number of records not to merge against anything, like row 1 in the example output. There was a service performed in October 2005, but the first Diagnosis I have for that cow is in June 2006 - there are probably a number of service records missing. That is unfortunately to be expected. For this example, only rows 5 and 6 make sense. For rows 3 and 4, I would consider only row 4, since that is probably the insemination that resulted into pregnancy.

Is that even possible in R?

Thank you!

czarniutki
  • 41
  • 5
  • Can you share data from events dataframe as well? Also what's the date range within which merger should happen? – Karthik S Oct 26 '20 at 15:25
  • Hi @KarthikS, thanks for your comment. I have added data from Events as well. The tricky thing about the date range is that it is "flexible" (to say the least)..... Anything up to 60 days after service_date would be considerable – czarniutki Oct 26 '20 at 15:35
  • For clarity, it would help if you provided the expected output for this sample data. Thanks! (Up front, I suspect this will best be done with one of: `data.table`, SQL/`sqldf`, or `fuzzyjoin`.) – r2evans Oct 26 '20 at 15:38
  • Hi @r2evans, thanks for your comment. I have added a sample output and some extra information, hopefully that is helpful. Thanks! – czarniutki Oct 26 '20 at 16:13
  • Why isn't fhe 2006-06-16 event_date matched to the 2005-10-22 service_date? – G. Grothendieck Oct 26 '20 at 16:46
  • Hi @G.Grothendieck, thank you for your question. The service was performed in October 2005, but the first Diagnosis I have for that cow is in June 2006 - there are probably a number of service records missing. That is unfortunately to be expected. If there is no other way, those two records can be merged together and I would then remove that row after merging both tables, by using number of days between service_date and event_date as the criteria. If too many days, that row would be removed. – czarniutki Oct 26 '20 at 16:58

2 Answers2

1

What you're asking for is a "non-equi" or "range" join. This isn't supported by base R (or dplyr, lacking dbplyr), but can be done with some other packages.

For all, I create event_date_lag so that we limit the amount of returns for each row. (Without it, we'd get multiple matches.)

fuzzyjoin

out <- fuzzyjoin::fuzzy_full_join(
  services, events,
  by = c("animal_id" = "animal_id",
         "service_date" = "event_date_lag",
         "service_date" = "event_date"),
  match_fun = list(`==`, `>=`, `<=`))
# not sure why fuzzyjoin is splitting animal_id
out <- transform(out, animal_id = ifelse(is.na(animal_id.x), animal_id.y, animal_id.x))
out$animal_id.x <- out$animal_id.y <- out$event_date_lag <- NULL
# ordering here primarily to compare with your desired output
out[with(out, order(ifelse(is.na(service_date), event_date, service_date))),]
#   service_date event_date event_description animal_id
# 6   2005-10-22       <NA>              <NA>    610710
# 7         <NA> 2006-06-16                PP    610710
# 1   2006-12-03 2007-02-15                PP    610710
# 2   2006-12-27 2007-02-15                PP    610710
# 3   2007-12-02 2008-01-09                PN    610710
# 4   2008-01-17 2008-04-09                PP    610710
# 5   2008-03-04 2008-04-09                PP    610710
# 8         <NA> 2009-06-16                PP    610710

sqldf

SQL in general supports the concept of non-equi or range joins. There's nothing special about the sqldf package, just that it provides a native SQL experience (via RSQLite) without the overhead or hassle of uploading your data to a SQL DBMS and pulling it back down in this query. While that is in fact what is happening with sqldf, it automates much of it, allowing one to work directly on R objects using SQL.

If by chance you are already getting your data from a DBMS, then a SQL join is by far the most efficient: get it joined at the source.

sqldf::sqldf(
  "select svc.animal_id, svc.service_date,
     ev.event_date, ev.event_description
   from services svc
     left join events ev on svc.animal_id=ev.animal_id
       and svc.service_date between ev.event_date_lag and ev.event_date
   order by svc.service_date, ev.event_date")
#   animal_id service_date event_date event_description
# 1    610710   2005-10-22       <NA>              <NA>
# 2    610710   2006-12-03 2007-02-15                PP
# 3    610710   2006-12-27 2007-02-15                PP
# 4    610710   2007-12-02 2008-01-09                PN
# 5    610710   2008-01-17 2008-04-09                PP
# 6    610710   2008-03-04 2008-04-09                PP

data.table

While I use this often, if you aren't already using it, then it might be a little more than you need (its learning curve, though worth it, can be steep).

Notes:

  • the data.table-semantics (Y[X], which is effectively "X left join Y") supports inner, left, and right, but not full, semi, or anti-joins. While it might be possible using a cross-join (cartesian product), that explodes memory use and is (imo) not the best way to go.

  • the join tends to rename the left side (the X in Y[X]) variables to that on the right. This can be confusing, and it can in fact mask the actual pre-merge values, so I'll duplicate service_date to keep it separate.

  • I'm using as.data.table here just for the SO answer, not because it's required to distinguish between data.frame and data.table variables. If you're switching to data.table, then setDT is the canonical way to go.

  • If you choose this but do not continue with other data.table operations, then make sure you convert back to normal data.frame using setDF or as.data.frame; there are enough subtle differences that not doing this will be a problem.

library(data.table)
svcDT <- as.data.table(services)
evDT <- as.data.table(events)
evDT[svcDT[,sdate:=service_date],
     on = .(animal_id == animal_id, event_date_lag <= sdate, event_date >= sdate)
     ][, event_date_lag := NULL ]
#    animal_id event_date event_description service_date
# 1:    610710 2005-10-22              <NA>   2005-10-22
# 2:    610710 2006-12-03                PP   2006-12-03
# 3:    610710 2006-12-27                PP   2006-12-27
# 4:    610710 2007-12-02                PN   2007-12-02
# 5:    610710 2008-01-17                PP   2008-01-17
# 6:    610710 2008-03-04                PP   2008-03-04

Data

services <- read.table(header = TRUE, text = "
animal_id     service_date
610710        2005-10-22
610710        2006-12-03
610710        2006-12-27
610710        2007-12-02
610710        2008-01-17
610710        2008-03-04")
services$service_date <- as.Date(services$service_date)

events <- read.table(header = TRUE, text = "
 animal_id     event_date        event_description
    610710     2006-06-16           PP
    610710     2007-02-15           PP
    610710     2008-01-09           PN
    610710     2008-04-09           PP
    610710     2009-06-16           PP")
events$event_date <- as.Date(events$event_date)
events$event_date_lag <- ave(events$event_date, events$animal_id, FUN=function(a) c(a[1][NA], a[-length(a)]))
events
#   animal_id event_date event_description event_date_lag
# 1    610710 2006-06-16                PP           <NA>
# 2    610710 2007-02-15                PP     2006-06-16
# 3    610710 2008-01-09                PN     2007-02-15
# 4    610710 2008-04-09                PP     2008-01-09
# 5    610710 2009-06-16                PP     2008-04-09
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you so very much for the options and for your help. Unfortunately, none of them worked. The issues are very likely on my end (sorry about that, I am not an experienced R user). The fuzzyjoin option crashes with "Error: vector memory exhausted (limit reached?) - the datasets are very large. The sqldf options fails because I cannot load the package properly (xcrun issue, also trying to work that out). data.table doesn't output to anything. It runs, doesn't complain, but it doesn't generate an output. I will update you once I fix the issues. Thanks for your help! – czarniutki Oct 28 '20 at 14:52
0

Using the input shown reproducibly in the Note at the end bind them together using rbind_rows and then sort them by date using arrange. Then define the logical column collapse which is TRUE if the current row has a service_date and the next row has an event_date and they are less than or equal to 90 days apart -- change 90 to whatever you want. Then group by animal_id and a group number which increases by 1 each time a service_date is encountered and further group by rows except if the current row has collapse equal to TRUE then place it in the same group as the next row in order that it be matched to that next row's event_date. Finally summarize the groups and remove the temporary columns.

Note that this approach maintains the event rows that do not have corresponding service dates and also ensures that each event date is not matched to more than one service date.

library(dplyr)

bind_rows(DF1, DF2) %>%
  arrange(coalesce(service_date, event_date)) %>%
  group_by(animal_id, group = cumsum(!is.na(service_date))) %>%
  mutate(collapse = !is.na(service_date) & !is.na(lead(event_date)) & 
       lead(event_date) - service_date <= 90) %>%
  group_by(n = 1:n() + collapse, .add = TRUE) %>%
  summarize(animal_id = first(animal_id), 
    service_date = first(service_date), 
    event_date = last(event_date), 
    event_description = last(event_description), .groups = "drop") %>%
  select(-group, -n)

giving:

# A tibble: 8 x 4
  animal_id service_date event_date event_description
      <int> <date>       <date>     <chr>            
1    610710 2005-10-22   NA         <NA>             
2    610710 NA           2006-06-16 PP               
3    610710 2006-12-03   NA         <NA>             
4    610710 2006-12-27   2007-02-15 PP               
5    610710 2007-12-02   2008-01-09 PN               
6    610710 2008-01-17   NA         <NA>             
7    610710 2008-03-04   2008-04-09 PP               
8    610710 NA           2009-06-16 PP     

sqldf

We can follow pretty much the same logic using the sqldf package:

library(sqldf)

sqldf("with b0 as 
        (select *, NULL event_date, NULL event_description from DF1 
        union 
        select animal_id, NULL service_date, event_date, event_description from DF2),
       b1 as (select *, coalesce(service_date, event_date) date1 
           from both order by animal_id, date1),
       b2 as (select *, lead(event_date) over () lead_event_date 
               from b1),
       b3 as (select *, coalesce(lead_event_date - service_date <= 90, 0) + 
                  row_number() over () coll 
                from b2)
      select distinct animal_id,
             group_concat(service_date) service_date, 
             group_concat(event_date) event_date, 
             group_concat(event_description) event_description 
        from b3 group by coll")

giving:

  animal_id service_date event_date event_description
1    610710   2005-10-22       <NA>              <NA>
2    610710         <NA> 2006-06-16                PP
3    610710   2006-12-03       <NA>              <NA>
4    610710   2006-12-27 2007-02-15                PP
5    610710   2007-12-02 2008-01-09                PN
6    610710   2008-01-17       <NA>              <NA>
7    610710   2008-03-04 2008-04-09                PP
8    610710         <NA> 2009-06-16                PP

Note

DF1 <- structure(list(animal_id = c(610710L, 610710L, 610710L, 610710L, 
610710L, 610710L), service_date = structure(c(13078, 13485, 13509, 
13849, 13895, 13942), class = "Date")), row.names = c(NA, -6L
), class = "data.frame")

DF2 <- structure(list(animal_id = c(610710L, 610710L, 610710L, 610710L, 
610710L), event_date = structure(c(13315, 13559, 13887, 13978, 
14411), class = "Date"), event_description = c("PP", "PP", "PN", 
"PP", "PP")), row.names = c(NA, -5L), class = "data.frame")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you very much for your help and comments! I couldn't try the sqldf option because I am having issues loading that package (with xcrun, trying to solve that), but the first option doesn't give me the same output as you got.. Not sure why. I don't get the dates to match, instead I get one row per service_date and one row per events_date. I am looking into it and will update you once I figure out what is wrong, just wanted to thank you for your help! – czarniutki Oct 28 '20 at 14:57
  • Copy the input in the Note at the end, paste it into R and then run the code in the body. Be aware that the dates are of Date class in the Note. We can't tell what you have since the question didn't provide the input in reproducible form. If you can't install sqldf then there is likely something wrong with your R installation. If you are using a nonstandard build without tcltk that could be a problem. – G. Grothendieck Oct 28 '20 at 15:10