Converting / Translating SQL to dplyr Left Join with Multiple Criteria

Question

Here's the setup (it's really not that complicated...):

Table JobTitles

| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A        | A1       | 1         | 5       |
| A        | A2       | 6         | 10      |
| A        | A3       | 11        | 15      |
| B        | B1       | 2         | 4       |
| B        | B2       | 5         | 7       |
| B        | B3       | 8         | 11      |
| C        | C1       | 5         | 12      |
| C        | C2       | 13        | 14      |
| C        | C3       | 15        | 18      |

Table Transactions:

| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A        | 2         | 5   |
| A        | 3         | 10  |
| A        | 12        | 5   |
| A        | 12        | 10  |
| B        | 3         | 5   |
| B        | 3         | 10  |
| B        | 10        | 5   |
| C        | 16        | 10  |
| C        | 17        | 5   |
| C        | 17        | 10  |
| C        | 17        | 5   |

Desired Output:

| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A        | A1       | 1         | 5       | 15  |
| A        | A2       | 6         | 10      | 0   |
| A        | A3       | 11        | 15      | 15  |
| B        | B1       | 2         | 4       | 15  |
| B        | B2       | 5         | 7       | 0   |
| B        | B3       | 8         | 11      | 5   |
| C        | C1       | 5         | 12      | 0   |
| C        | C2       | 13        | 14      | 0   |
| C        | C3       | 15        | 18      | 30  |

This SQL gets me the desired output:

select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
     Transactions t
     on jt.PersonId = t.PersonId and
        t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;

Tables for R:

JobTitles <- structure(list(PersonID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), JobTitle = structure(1:9, .Label = c("A1", 
"A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"), class = "factor"), 
    StartDate = c(1L, 6L, 11L, 2L, 5L, 8L, 5L, 13L, 15L), EndDate = c(5L, 
    10L, 15L, 4L, 7L, 11L, 12L, 14L, 18L)), .Names = c("PersonID", 
"JobTitle", "StartDate", "EndDate"), class = "data.frame", row.names = c(NA, 
-9L))
Transactions <- structure(list(PersonID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    TransDate = c(2L, 3L, 12L, 12L, 3L, 3L, 10L, 16L, 17L, 17L, 
    17L), Amt = c(5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 
    5L)), .Names = c("PersonID", "TransDate", "Amt"), class = "data.frame", row.names = c(NA, 
-11L))

How do I translate the SQL into working dplyr code? I am getting stuck on the left_join:

left_join(JobTitles, Transactions, 
          by = c("PersonID" = "PersonID", 
                 "StartDate" < "TransDate",
                 "EndDate" >= "TransDate"))
# Error: cannot join on columns 'TRUE' x '' : index out of bounds

Please explain your SQL statement in words for those who are not fluent in SQL — talat, Dec 09 '14 at 17:04
The disadvantage with your SQL syntax as well as the `dplyr` answers is that we've to first perform a (left) join to obtain the entire data, after which there's a filtering and grouping and summarising. This'll be very inefficient (in terms of memory and speed) if you've large data. If you're not wedded to dplyr, checkout `?foverlaps` from data.table or `?findOverlaps` from IRanges packages (if you're working in-memory that is). — Arun, Dec 09 '14 at 17:10
@zx8754 I certainly could, but I am trying to understand it from the `dplyr` perspective and actually I was hoping for someone to mention something with regards to efficiency like @Arun just did. — JasonAizkalns, Dec 09 '14 at 17:12
@Arun I am looking into `foverlaps` now. Seeing the example in the CRAN manual, it seems that you need both start and end in two data.table objects. Is that right? In this question, the data frame, Transactions does not have start & end. It contains transaction dates only, which may stay between start and end date in the other data frame, JobTitles. Would it be possible to use `forverlaps` in this case? — jazzurro, Dec 10 '14 at 01:56
It's not currently possible to express this elegantly with dplyr — hadley, Dec 10 '14 at 02:40
@arun but the query optimizer might not implement it that way in a database — hadley, Dec 10 '14 at 02:41
@jazzurro, yes, by creating a dummy column. please check [this](http://stackoverflow.com/a/25655497/559784), [this](http://stackoverflow.com/a/27055265/559784) or [this](http://stackoverflow.com/a/26158226/559784) post.. — Arun, Dec 10 '14 at 08:18
@hadley, do you know how is it done internally? Either it has to compare ranges + subset for each row which will be memory efficient but very very slow (check [this](http://stackoverflow.com/a/25655497/559784) post) or it has to join first + subset + summarise, which will be slightly faster, but would be memory inefficient. I can't see how it could be efficient in terms of both speed and memory. — Arun, Dec 10 '14 at 08:19
@Arun Thank you very much for the link. I will check them out tomorrow. :) — jazzurro, Dec 10 '14 at 15:41
@arun I don't know, I was wondering if you had some special knowledge that there was no possible optimisation in this case, or that the db definitely didn't apply known optimisations — hadley, Dec 10 '14 at 21:28
@hadley, I already pointed you to a link which benchmarks (both speed and memory) `sqldf` code along with `foverlaps` (77s vs 3.8s, 1.4GB vs 1GB) just on the join part (which is similar to OP's query here). And to my knowledge, a task on joining intervals can't be performed more efficiently than algorithms designed to deal with intervals efficiently. I'm not aware of databases able to optimise more efficiently than interval joins. You seem to talk about optimisations, but not know what optimisations could make it (more) efficient which is confusing. — Arun, Dec 11 '14 at 06:46
@Arun sqldf uses sqlite which does not have a particularly good query engine. I don't understand why you think a database couldn't implement an efficient algorithm for interval joins. — hadley, Dec 11 '14 at 14:35
@hadley, my comment was on OP's specific query (hence *your* SQL syntax). I ran the same query using `sqldf` with `PostgreSQL` - it was 3x slower. Again, if you've specific optimisations, let me know and I'd be happy to try it (and change my opinion). "I am not aware of" is very different from "I think it's not possible to implement". My tests so far haven't proven any different. Please show an example if you think it could be improved. — Arun, Dec 11 '14 at 18:02
@hadley (and others), I've asked a question [here on SO](http://stackoverflow.com/q/27433474/559784) asking for most efficient optimisation possible (but not haven't gotten success yet). But I found [this post](http://stackoverflow.com/q/10907910/559784) after posting where the OP seems to have resorted to *interval trees* (but not clear if it's in SQL). I've searched quite a bit on SO without success. I can't seem to find optimised query to handle this case. It'd be nice if you could point me to a link or provide a solution on how to optimise this query efficiently in SQL. — Arun, Dec 12 '14 at 12:29

score 4 · Accepted Answer · edited Dec 09 '14 at 17:22

Similar to the idea of @zx8754, I came up with the following. I tried to use between because it was in the SQL script. But, the outcome is basically identical to what @zx8754 has. I further did calculation and got an outcome (i.e., foo). Then, I merged it with two columns (i.e., PersonID JobTitle) from JobTitles in order to get the expected outcome.

foo <- left_join(JobTitles, Transactions) %>%
       rowwise() %>%
       mutate(check = between(TransDate, StartDate, EndDate)) %>%
       filter(check == TRUE) %>%
       group_by(PersonID, JobTitle) %>%
       summarise(total = sum(Amt))

### Merge the master frame including all combs of PersonID and JobTitle, and foo
foo2 <- left_join(JobTitles[,c(1,2)], foo)

### NA to 0 
foo2$total[which(foo2$total %in% NA)] <- 0

#  PersonID JobTitle total
#1        A       A1    15
#2        A       A2     0
#3        A       A3    15
#4        B       B1    15
#5        B       B2     0
#6        B       B3     5
#7        C       C1     0
#8        C       C2     0
#9        C       C3    30

Or slightly shorter and in one pipe:

left_join(JobTitles, Transactions) %>%
  filter(TransDate > StartDate & TransDate < EndDate) %>%
  group_by(PersonID, JobTitle) %>%
  summarise(total = sum(Amt)) %>%
  left_join(JobTitles[,c(1,2)], .) %>%
  mutate(total = replace(total, is.na(total), 0))

@beginneR Hey man, sorry for this late reply. I went to bed right after posting the answer. Your idea is great! This is the way to go if you do all in piping. I thought about using `replace` as well, but I didn't get bothered. Thank you for your contribution! — jazzurro, Dec 10 '14 at 00:27

Converting / Translating SQL to dplyr Left Join with Multiple Criteria

1 Answers1