1

I have two data frames, ord with 15k rows

         uniq.id prev.ord.dt cur.ord.dt
           <chr>      <date>     <date>
1  4892267119791  2016-04-28 2016-06-09
2  7174853145105  2016-04-26 2016-05-10
3  9600318129252  2016-07-07 2016-07-11
4  7150993150290  2016-03-30 2016-04-13
5  3121040102603  2016-05-01 2016-05-18
6  4899102101891  2016-04-29 2016-05-08
7  7174853112259  2016-05-10 2016-05-24
8  4648283132540  2016-04-25 2016-05-09
9  9600318144830  2016-05-12 2016-05-19
10 9600318134838  2016-05-23 2016-06-06
...

and dlvrd. with 39k rows

uniq.id   dlvrd.dt dlvrd.qty
           <chr>     <date>     <int>
1  9600318114229 2016-02-24        10
2  2594775116151 2016-04-07        22
3  4935357145929 2016-05-26         6
4  4127487134929 2016-05-18         9
5  4935357144169 2016-05-10        62
6  1180975135573 2016-03-16        11
7  3121040102604 2016-06-17        22
8  1580171115586 2016-01-12       240
9  3011291147292 2016-03-25        12
10 4892267115584 2016-05-16        21
...

EDIT: There are about 2k total of the uniq.id, and each one has multiple delivery periods under it (assume for simplicity that a delivery period is indicated by cur.ord.dt). I need to aggregate the total quantity that was delivered for each uniq.id for each delivery period, but the method I'm currently using is taking ~35 minutes to run. I defined a custom function

add.dlvrd <- function(uniq, prev, cur) {
  require(dplyr)
  dlvrd <- get("dlv.data", envir = .GlobalEnv)

  dlvrd <- dlvrd %>% filter(uniq.id==uniq,
                            dlvrd.dt >= prev, 
                            dlvrd.dt < cur) %$% sum(dlvrd.qty) %>% return()
}

and then run mapply

ord$dlvrd.qty <- ord %$% mapply(add.dlvrd, uniq.id, prev.ord.dt, cur.ord.dt)

Is there a more elegant way of doing this?

Side note: I'm aware that it's bad form to use "." in variable and function names instead of "_", but I don't have time to change it right now.

Jonathan H
  • 89
  • 9
  • 1
    Could you make a small reproducible example? The data.table package has some tools for "interval joins", e.g., here: http://stackoverflow.com/questions/27487949/how-to-perform-join-over-date-ranges-using-data-table (though that link is somewhat old) – Frank Aug 04 '16 at 18:31
  • @Frank The foverlaps function might work for me, but there may be multiple entries found in the "y", and I would want to sum these values instead of simply returning them – Jonathan H Aug 04 '16 at 18:45
  • 1
    Yeah, I think it doesn't quite fit. The latest tool for interval joins, does allow for summing, though: http://stackoverflow.com/a/38394201/ (found by following links in my last link). You'll probably need to learn some data.table syntax before using it, though. – Frank Aug 04 '16 at 18:47
  • Also, what is the quickest way to make a reproducible example? – Jonathan H Aug 04 '16 at 18:47
  • The usual advice for making a good example is here: http://stackoverflow.com/a/28481250/ – Frank Aug 04 '16 at 18:48
  • 2
    Concerning your "side note" at the end: There are different opinions about the use of dots in variable and function names. The snake case is not generally preferred, and using dots is not generally considered bad style. To the contrary, some style guides expressly suggest using names the way you did and consider the snake case bad style. A topic of heated and mostly unnecessary debates... – RHertel Aug 04 '16 at 18:53

1 Answers1

2

In SQL speak, you need a correlated aggregate subquery that would look like the following (which might be viable in sqldf package)

SELECT ord.uniqid, ord.prevorddt, ord.curorddt, 
       (SELECT SUM(dlvrd.dlvrqty) 
        FROM dlvrd.dlvrqty
        WHERE dlvrd.uniqid = ord.uniqid
        AND dlvrd.dlvrddt >= ord.prevorddt
        AND dlvrd.dlvrddt <= ord.curorddt) AS dlvrqty
FROM ord

The counterpart in base R would be a sapply() conditional, correlated sum:

ord$dlvr.qty <- sapply(1:nrow(ord), function(i) {
                     tempdf <- dlvrd[dlvrd$dlvrd.dt >= ord$prev.ord.dt[i] &
                                     dlvrd$dlvrd.dt < ord$cur.ord.dt[i] &
                                     dlvrd$uniq.id == ord$uniq.id[i],]
                     sum(tempdf$dlvrd.qty)
})

Though above solution resembles your original. From your posted data, a noticeable difference emerges between: 1) mapply and dplyr and 2) sapply and base:

library(microbenchmark)
microbenchmark(ord$dlvrd.qty <- with(ord, 
                                     mapply(add.dlvrd, uniq.id, prev.ord.dt, cur.ord.dt)))
#      min       lq     mean  median       uq      max neval
# 23.40284 24.21174 25.98971 25.6515 27.22191 32.95809   100

microbenchmark(ord$dlvr.qty <- sapply(1:nrow(ord), function(i) {
                               tempdf <- dlvrd[dlvrd$dlvrd.dt >= ord$prev.ord.dt[i] &
                                               dlvrd$dlvrd.dt < ord$cur.ord.dt[i] &
                                               dlvrd$uniq.id == ord$uniq.id[i],]
                                sum(tempdf$dlvrd.qty)
                            }))    
#      min       lq     mean   median       uq      max neval
# 6.426951 6.592485 7.157509 6.779431 7.124455 11.30587   100

And vapply() is slightly faster but may save in longer runs:

microbenchmark(ord$dlvr.qty <- vapply(1:nrow(ord), function(i) {
                               tempdf <- dlvrd[dlvrd$dlvrd.dt >=  ord$prev.ord.dt[i] &
                                               dlvrd$dlvrd.dt < ord$cur.ord.dt[i] &
                                               dlvrd$uniq.id == ord$uniq.id[i],]
                                        sum(tempdf$dlvrd.qty)
                             }, numeric(1)))
#      min       lq     mean   median       uq      max neval
# 6.395672 6.525357 6.912836 6.592966 6.865086 9.737148   100
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • What is order.dv in the first merge()? – Jonathan H Aug 04 '16 at 19:07
  • When I tried this on my actual data, the `dlvrd.qty` came back as the same value for each `uniq.id`. – Jonathan H Aug 04 '16 at 19:31
  • Sure, your objective was: *aggregate total quantity for each uniq.id* (between prev and cur dates). And uniq.id repeats for each order in distinct order in `ord`. – Parfait Aug 04 '16 at 20:37
  • I have added a clarification to the objective above. – Jonathan H Aug 04 '16 at 20:48
  • I got this working, thank you! I ended up getting access to an extra field in the dlvrd data field with the order date, making the match even simpler. You've also helped advance my knowledge of how to do this kind of thing in the future. – Jonathan H Aug 09 '16 at 13:53
  • Excellent! Glad I could help a fellow R programmer. – Parfait Aug 09 '16 at 16:46