0

Here's the setup (it's really not that complicated...):

Table JobTitles

| PersonID | JobTitle | StartDate | EndDate |
|----------|----------|-----------|---------|
| A        | A1       | 1         | 5       |
| A        | A2       | 6         | 10      |
| A        | A3       | 11        | 15      |
| B        | B1       | 2         | 4       |
| B        | B2       | 5         | 7       |
| B        | B3       | 8         | 11      |
| C        | C1       | 5         | 12      |
| C        | C2       | 13        | 14      |
| C        | C3       | 15        | 18      |

Table Transactions:

| PersonID | TransDate | Amt |
|----------|-----------|-----|
| A        | 2         | 5   |
| A        | 3         | 10  |
| A        | 12        | 5   |
| A        | 12        | 10  |
| B        | 3         | 5   |
| B        | 3         | 10  |
| B        | 10        | 5   |
| C        | 16        | 10  |
| C        | 17        | 5   |
| C        | 17        | 10  |
| C        | 17        | 5   |

Desired Output:

| PersonID | JobTitle | StartDate | EndDate | Amt |
|----------|----------|-----------|---------|-----|
| A        | A1       | 1         | 5       | 15  |
| A        | A2       | 6         | 10      | 0   |
| A        | A3       | 11        | 15      | 15  |
| B        | B1       | 2         | 4       | 15  |
| B        | B2       | 5         | 7       | 0   |
| B        | B3       | 8         | 11      | 5   |
| C        | C1       | 5         | 12      | 0   |
| C        | C2       | 13        | 14      | 0   |
| C        | C3       | 15        | 18      | 30  |

This SQL gets me the desired output:

select jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate, coalesce(sum(amt), 0) as amt
from JobTitles jt left join
     Transactions t
     on jt.PersonId = t.PersonId and
        t.TransDate between jt.StartDate and jt.EndDate
group by jt.PersonID, jt.JobTitle, jt.StartDate, jt.EndDate;

Tables for R:

JobTitles <- structure(list(PersonID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), JobTitle = structure(1:9, .Label = c("A1", 
"A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"), class = "factor"), 
    StartDate = c(1L, 6L, 11L, 2L, 5L, 8L, 5L, 13L, 15L), EndDate = c(5L, 
    10L, 15L, 4L, 7L, 11L, 12L, 14L, 18L)), .Names = c("PersonID", 
"JobTitle", "StartDate", "EndDate"), class = "data.frame", row.names = c(NA, 
-9L))
Transactions <- structure(list(PersonID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), 
    TransDate = c(2L, 3L, 12L, 12L, 3L, 3L, 10L, 16L, 17L, 17L, 
    17L), Amt = c(5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 5L, 10L, 
    5L)), .Names = c("PersonID", "TransDate", "Amt"), class = "data.frame", row.names = c(NA, 
-11L))

How do I translate the SQL into working dplyr code? I am getting stuck on the left_join:

left_join(JobTitles, Transactions, 
          by = c("PersonID" = "PersonID", 
                 "StartDate" < "TransDate",
                 "EndDate" >= "TransDate"))
# Error: cannot join on columns 'TRUE' x '' : index out of bounds
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116
  • Please explain your SQL statement in words for those who are not fluent in SQL – talat Dec 09 '14 at 17:04
  • 2
    The disadvantage with your SQL syntax as well as the `dplyr` answers is that we've to first perform a (left) join to obtain the entire data, after which there's a filtering and grouping and summarising. This'll be very inefficient (in terms of memory and speed) if you've large data. If you're not wedded to dplyr, checkout `?foverlaps` from data.table or `?findOverlaps` from IRanges packages (if you're working in-memory that is). – Arun Dec 09 '14 at 17:10
  • Why not just use `sqldf` package? – zx8754 Dec 09 '14 at 17:11
  • @zx8754 I certainly could, but I am trying to understand it from the `dplyr` perspective and actually I was hoping for someone to mention something with regards to efficiency like @Arun just did. – JasonAizkalns Dec 09 '14 at 17:12
  • @Arun I am looking into `foverlaps` now. Seeing the example in the CRAN manual, it seems that you need both start and end in two data.table objects. Is that right? In this question, the data frame, Transactions does not have start & end. It contains transaction dates only, which may stay between start and end date in the other data frame, JobTitles. Would it be possible to use `forverlaps` in this case? – jazzurro Dec 10 '14 at 01:56
  • 2
    It's not currently possible to express this elegantly with dplyr – hadley Dec 10 '14 at 02:40
  • @arun but the query optimizer might not implement it that way in a database – hadley Dec 10 '14 at 02:41
  • 2
    @jazzurro, yes, by creating a dummy column. please check [this](http://stackoverflow.com/a/25655497/559784), [this](http://stackoverflow.com/a/27055265/559784) or [this](http://stackoverflow.com/a/26158226/559784) post.. – Arun Dec 10 '14 at 08:18
  • 2
    @hadley, do you know how is it done internally? Either it has to compare ranges + subset for each row which will be memory efficient but very very slow (check [this](http://stackoverflow.com/a/25655497/559784) post) or it has to join first + subset + summarise, which will be slightly faster, but would be memory inefficient. I can't see how it could be efficient in terms of both speed and memory. – Arun Dec 10 '14 at 08:19
  • @Arun Thank you very much for the link. I will check them out tomorrow. :) – jazzurro Dec 10 '14 at 15:41
  • @arun I don't know, I was wondering if you had some special knowledge that there was no possible optimisation in this case, or that the db definitely didn't apply known optimisations – hadley Dec 10 '14 at 21:28
  • @hadley, I already pointed you to a link which benchmarks (both speed and memory) `sqldf` code along with `foverlaps` (77s vs 3.8s, 1.4GB vs 1GB) just on the join part (which is similar to OP's query here). And to my knowledge, a task on joining intervals can't be performed more efficiently than algorithms designed to deal with intervals efficiently. I'm not aware of databases able to optimise more efficiently than interval joins. You seem to talk about optimisations, but not know what optimisations could make it (more) efficient which is confusing. – Arun Dec 11 '14 at 06:46
  • @Arun sqldf uses sqlite which does not have a particularly good query engine. I don't understand why you think a database couldn't implement an efficient algorithm for interval joins. – hadley Dec 11 '14 at 14:35
  • @hadley, my comment was on OP's specific query (hence *your* SQL syntax). I ran the same query using `sqldf` with `PostgreSQL` - it was 3x slower. Again, if you've specific optimisations, let me know and I'd be happy to try it (and change my opinion). "I am not aware of" is very different from "I think it's not possible to implement". My tests so far haven't proven any different. Please show an example if you think it could be improved. – Arun Dec 11 '14 at 18:02
  • @hadley (and others), I've asked a question [here on SO](http://stackoverflow.com/q/27433474/559784) asking for most efficient optimisation possible (but not haven't gotten success yet). But I found [this post](http://stackoverflow.com/q/10907910/559784) after posting where the OP seems to have resorted to *interval trees* (but not clear if it's in SQL). I've searched quite a bit on SO without success. I can't seem to find optimised query to handle this case. It'd be nice if you could point me to a link or provide a solution on how to optimise this query efficiently in SQL. – Arun Dec 12 '14 at 12:29

1 Answers1

4

Similar to the idea of @zx8754, I came up with the following. I tried to use between because it was in the SQL script. But, the outcome is basically identical to what @zx8754 has. I further did calculation and got an outcome (i.e., foo). Then, I merged it with two columns (i.e., PersonID JobTitle) from JobTitles in order to get the expected outcome.

foo <- left_join(JobTitles, Transactions) %>%
       rowwise() %>%
       mutate(check = between(TransDate, StartDate, EndDate)) %>%
       filter(check == TRUE) %>%
       group_by(PersonID, JobTitle) %>%
       summarise(total = sum(Amt))

### Merge the master frame including all combs of PersonID and JobTitle, and foo
foo2 <- left_join(JobTitles[,c(1,2)], foo)

### NA to 0 
foo2$total[which(foo2$total %in% NA)] <- 0

#  PersonID JobTitle total
#1        A       A1    15
#2        A       A2     0
#3        A       A3    15
#4        B       B1    15
#5        B       B2     0
#6        B       B3     5
#7        C       C1     0
#8        C       C2     0
#9        C       C3    30

Or slightly shorter and in one pipe:

left_join(JobTitles, Transactions) %>%
  filter(TransDate > StartDate & TransDate < EndDate) %>%
  group_by(PersonID, JobTitle) %>%
  summarise(total = sum(Amt)) %>%
  left_join(JobTitles[,c(1,2)], .) %>%
  mutate(total = replace(total, is.na(total), 0))
talat
  • 68,970
  • 21
  • 126
  • 157
jazzurro
  • 23,179
  • 35
  • 66
  • 76
  • 3
    I hope you don't mind that I added a version in one pipe – talat Dec 09 '14 at 17:23
  • 1
    @beginneR Hey man, sorry for this late reply. I went to bed right after posting the answer. Your idea is great! This is the way to go if you do all in piping. I thought about using `replace` as well, but I didn't get bothered. Thank you for your contribution! – jazzurro Dec 10 '14 at 00:27