1

I am trying to achieve the 'Final.Data' output shown below.

We start with the Reference data and I want to add the 'Add.Data' but join on the 'Person' and return the most recent result prior to the reference (date).

I am looking for dplyr, data.table or sql solutions in r.

I then want to be able to reproduce this for 1000s of entries, so looking for a reasonable efficient solution.

library(tibble)
Reference.Data  <-  tibble(Person = "John",
                           Date = "2019-07-10")

Add.Data <- tibble(Person = "John",
                   Order.Date = c("2019-07-09","2019-07-08") ,
                   Order = 1:2)

Final.Data <- tibble(Person = "John",
                     Date = "2019-07-10",
                     Order.Date = "2019-07-09",
                     Order = 1)
Frank
  • 66,179
  • 8
  • 96
  • 180
JFG123
  • 577
  • 5
  • 13

5 Answers5

4

A roling join to the nearest before date should work pretty fast..

#data preparation:
# convert to data.tables, set dates as 'real' dates
DT1 <- setDT(Reference.Data)[, Date := as.IDate( Date )]
DT2 <- setDT(Add.Data)[, Order.Date := as.IDate( Order.Date )]
#set keys (this also orders the dates, convenient for the join later)
setkey(DT1, Person, Date)
setkey(DT2, Person, Order.Date)

#perform rolling update join on DT1 
DT1[ DT2, `:=`( Order.date = i.Order.Date, Order = i.Order), roll = -Inf][]

#    Person       Date Order.date Order
# 1:   John 2019-07-10 2019-07-09     1
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • 1
    **!!UPDATE**: changed rolling to "nearest" to rolling to "-Inf", since rolling to "nearest" can result in rolling to Order.Date *after* Date... -Inf only rolls to Order.Dates *before* Date !! – Wimpel Jul 10 '19 at 07:32
  • 2
    Re why sorting by date is critical for this approach: `DT1[DT2, x := i.x]` when DT1 rows are matched by multiple DT2 rows, the assignment just uses the last row (in the order of DT2). Eg, try reversing the rows' order `DT1[ DT2[2:1], on=.(Person, Date = Order.Date), `:=`( Order.date = i.Order.Date, Order = i.Order), roll = -Inf][]`. For this reason, I always do it as in sindri's answer (so behavior doesn't depend on keying/sorting). Some more reasons here if anyone's interested: https://stackoverflow.com/a/54313203 – Frank Jul 10 '19 at 16:01
  • 1
    @Frank you are right, but I assumed setting the keys before joining took care of the date-order (by person, by date, ascending).. so the rolling to -Inf always matches to the "bottom" line of hits (which contains the latest date)... In your example-code in the comment above, (I assume) you have not set keys, since you call them explicit using the `on`-argument? Or do I miss something here? If so, please let me know. – Wimpel Jul 10 '19 at 18:44
  • 2
    Nope, you didn't miss anything and are right. It matches *all* the matching rows, but applies all of them so only the bottom one matters, is my understanding. I just prefer that I can read the update join line of code on its own and understand what will happen (while doing the join as in this answer requires knowing that DT2 is sorted and that data.table has this pretty obscure behavior when there are duplicate matches). – Frank Jul 10 '19 at 20:11
  • 1
    @Frank You are right about the stand-alone readability of code. Will take this into considerarion into future code. Thanks! – Wimpel Jul 10 '19 at 20:23
3

An approach using data.table non-equi join and update by reference directly on Reference.Data:

library(data.table)
setDT(Add.Data)
setDT(Reference.Data)
setorder(Add.Data, Person, Order.Date)
Reference.Data[, (names(Add.Data)) :=
    Add.Data[.SD, on=.(Person, Order.Date<Date), mult="last",
        mget(paste0("x.", names(Add.Data)))]
]

output:

   Person       Date Order.Date Order
1:   John 2019-07-10 2019-07-09     1
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
2

Another data.table solution:

setDT(Add.Data)[, Order.Date := as.Date(Order.Date)]
setDT(Reference.Data)[, Date := as.Date(Date)]

Reference.Data[, c("Order.Date", "Order") := Add.Data[.SD, 
                                                      on = .(Person, Order.Date = Date), 
                                                      roll = TRUE, 
                                                      .(x.Order.Date, x.Order)]]
Reference.Data

#    Person       Date Order.Date Order
# 1:   John 2019-07-10 2019-07-09     1
s_baldur
  • 29,441
  • 4
  • 36
  • 69
1

We can do a inner_join and then group by 'Person', slice the row with the max 'Order.Date'

library(tidyverse)
inner_join(Add.Data, Reference.Data) %>%
    group_by(Person) %>% 
    slice(which.max(as.Date(Order.Date)))
# A tibble: 1 x 4
# Groups:   Person [1]
#  Person Order.Date Order Date      
#  <chr>  <chr>      <int> <chr>     
#1 John   2019-07-09     1 2019-07-10

Or using data.tabl#

library(data.table)
setDT(Add.Data)[as.data.table(Reference.Data), on = .(Person)][, 
          .SD[which.max(as.Date(Order.Date))], by = Person]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This version does not take into account if the order date is greater than the reference date. – JFG123 Jul 10 '19 at 07:43
0

Left join the Reference.Data to the Add.Data joining on Person and on Order.Date being at or before Date. Group that by the original Reference.Data rows and take the maximum Order.Date from those. The way it works is that the Add.Data row that is used for each row of Reference.Data will be the one with the maximum Order.Date so the correct Order will be shown.

Note that dot is an SQL operator and order is an SQL keyword so we must surround names with a dot or the name order (regardless of case) with square brackets.

library(sqldf)

sqldf("select r.*, max(a.[Order.Date]) as [Order.Date], a.[Order]
  from [Reference.Data] as r
  left join [Add.Data] as a on r.Person = a.Person and a.[Order.Date] <= r.Date
  group by r.rowid")

giving:

  Person       Date Order.Date Order
1   John 2019-07-10 2019-07-09     1

I haven't checked how fast this is (adding indexes could speed it up if need be) but with only a few thousand rows efficiency is not likely as important as readability.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341