2

I would like to join two data sets, A and B. I would like to join A and B exactly on their id variables, but keep only the most recent observation in B that is between three months and three years old.

The data sets are big enough that I need to use the sqldf package (about 500,000 rows in A and 250,000 rows in B). It seems that the logic should be to LEFT OUTER JOIN A AND B with A.id = B.id and (A.date - B.date) BETWEEN 3*30 AND 3*365, then GROUP BY A.row, ORDER BY B.date DESC, then keep first observation. But my code below keeps the first observation OVERALL, not the first observation per A.row group.

I can do this join in two steps (one sqldf, one tidyverse), but can sqldf do both steps?

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite

# Some toy data:
A <- tibble(id = rep(1:10, each = 2),
               subid = rep(1:2, 10),
               date = rep(ymd('2019-01-01'), 20))
A$row <- seq(nrow(A))


set.seed(42)
B <- tibble(id = rep(1:10, each = 10),
            date = ymd('2015-01-01') + months(10*rep(1:10, 10)),
            x = runif(100))


# This code properly matches A and B, but only returns the first observation OVERALL, not per A.row:
C <- sqldf('SELECT *
           FROM A
           LEFT OUTER JOIN B
           ON A.id = B.id
           AND (A.date - B.date) BETWEEN 3*30 and 3*365
           GROUP BY row
           ORDER BY B.date DESC
           LIMIT 1') %>% 
    as_tibble()
C
#> # A tibble: 1 x 7
#>      id subid date         row id..5 date..6     x
#>   <int> <int> <date>     <int> <int>   <dbl> <dbl>
#> 1     1     1 2019-01-01     1     1   17652 0.830


# I could do this in two steps, with the first step in sqldf and the second step in the tidyverse. This two step approach would work my data, because B has annual data, so there should not be more than three matches per row in A. However, it seems like I should be able to do the entire join in sqldf (and maybe one data I will not be able to do the second step in the tidyverse).
D <- sqldf('SELECT *
           FROM A
           LEFT OUTER JOIN B
           ON A.id = B.id
           AND (A.date - B.date) BETWEEN 3*30 and 3*365') %>% 
    as_tibble()

E <- D %>%
    arrange(row, desc(date..6)) %>%
    group_by(row) %>%
    filter(row_number() == 1) %>%
    ungroup()


# Below is the desired output. Can sqldf do both steps?
E
#> # A tibble: 20 x 7
#>       id subid date         row id..5 date..6     x
#>    <int> <int> <date>     <int> <int>   <dbl> <dbl>
#>  1     1     1 2019-01-01     1     1   17652 0.830
#>  2     1     2 2019-01-01     2     1   17652 0.830
#>  3     2     1 2019-01-01     3     2   17652 0.255
#>  4     2     2 2019-01-01     4     2   17652 0.255
#>  5     3     1 2019-01-01     5     3   17652 0.947
#>  6     3     2 2019-01-01     6     3   17652 0.947
#>  7     4     1 2019-01-01     7     4   17652 0.685
#>  8     4     2 2019-01-01     8     4   17652 0.685
#>  9     5     1 2019-01-01     9     5   17652 0.974
#> 10     5     2 2019-01-01    10     5   17652 0.974
#> 11     6     1 2019-01-01    11     6   17652 0.785
#> 12     6     2 2019-01-01    12     6   17652 0.785
#> 13     7     1 2019-01-01    13     7   17652 0.566
#> 14     7     2 2019-01-01    14     7   17652 0.566
#> 15     8     1 2019-01-01    15     8   17652 0.479
#> 16     8     2 2019-01-01    16     8   17652 0.479
#> 17     9     1 2019-01-01    17     9   17652 0.646
#> 18     9     2 2019-01-01    18     9   17652 0.646
#> 19    10     1 2019-01-01    19    10   17652 0.933
#> 20    10     2 2019-01-01    20    10   17652 0.933

Created on 2019-07-12 by the reprex package (v0.3.0)

Richard Herron
  • 9,760
  • 12
  • 69
  • 116
  • In SQL `LIMIT` isn't an aggregate operation that can be `GROUP BY`ed. The common SQL solution would be to use a window function, [something like this](https://stackoverflow.com/a/1313293/903061). – Gregor Thomas Jul 12 '19 at 18:41

2 Answers2

3

Consider window functions such as RANK() of which likely dplyr::row_number() is adopted (among other SQL semantics like select, group_by, case_when). SQLite (default dialect of sqldf) recently added support of window functions in version 3.25.0 (September 2018 release).

If not available in sqldf (depending on version), use a Postgres backend via RPostgreSQL. See author docs. Possibly too or soon, RMySQL will be another supported backend as MySQL 8 recently added support of window functions.

library(RPostgreSQL)
library(sqldf)

D <- sqldf('WITH cte AS
               (SELECT *,
                       RANK() OVER (PARTITION BY "B".row ORDER BY "B".date DESC) AS rn
                FROM "A"
                LEFT JOIN "B"
                    ON "A".id = "B".id
                   AND ("A".date - "B".date) BETWEEN 3*30 and 3*365
               )

           SELECT * FROM cte
           WHERE rn = 1')
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Your code works with my toy data. But with my real data I get the following error: `Error in postgresqlExecStatement(conn, statement, ...) : RS-DBI driver: (could not Retrieve the result : ERROR: column temp_lt.ncusip does not exist LINE 6: LEFT JOIN "ccm" ON "temp_lt".NCUSIP = "ccm".cusip8` Here `temp_lt` is `A` and `ccm` is `B` (and CUSIPs are `id`s) – Richard Herron Jul 12 '19 at 20:57
  • 1
    Is column name all caps like that? Because R is case sensitive and Postgres renders all identifiers lower case, you need to wrap field names with all caps in quotes like I do with *A* and *B* tables here. So try: `"temp_lt"."NCUSIP" = "ccm"."cusip8"` – Parfait Jul 12 '19 at 21:00
  • 1
    That does it. I did not realize that is why `A` and `B` were quoted. I thought tables were quoted, but uppercase is quoted to preserve case. Thanks! – Richard Herron Jul 12 '19 at 21:06
  • 1
    The development version of RSQLite includes a more recent version of SQLite so with that and sqldf one could use windowing functions with SQLite: https://github.com/r-dbi/RSQLite – G. Grothendieck Jul 12 '19 at 21:14
1

In SQLite if you use max or min in a group by then that entire row is used so:

sqldf('SELECT 
    A.rowid as A_row, 
    A.id, 
    A.subid, 
    A.date as A_date__Date, 
    max(B.rowid) as B_row, 
    B.date as B_date__Date, 
    B.x
  FROM A
  LEFT OUTER JOIN B ON A.id = B.id AND (A.date - B.date) BETWEEN 3*30 AND 3*365
  GROUP BY A.rowid
  ', method = "name__class")

giving:

   A_row id subid     A_date B_row     B_date         x
1      1  1     1 2019-01-01     4 2018-05-01 0.8304476
2      2  1     2 2019-01-01     4 2018-05-01 0.8304476
3      3  2     1 2019-01-01    14 2018-05-01 0.2554288
4      4  2     2 2019-01-01    14 2018-05-01 0.2554288
5      5  3     1 2019-01-01    24 2018-05-01 0.9466682
6      6  3     2 2019-01-01    24 2018-05-01 0.9466682
7      7  4     1 2019-01-01    34 2018-05-01 0.6851697
8      8  4     2 2019-01-01    34 2018-05-01 0.6851697
9      9  5     1 2019-01-01    44 2018-05-01 0.9735399
10    10  5     2 2019-01-01    44 2018-05-01 0.9735399
11    11  6     1 2019-01-01    54 2018-05-01 0.7846928
12    12  6     2 2019-01-01    54 2018-05-01 0.7846928
13    13  7     1 2019-01-01    64 2018-05-01 0.5664884
14    14  7     2 2019-01-01    64 2018-05-01 0.5664884
15    15  8     1 2019-01-01    74 2018-05-01 0.4793986
16    16  8     2 2019-01-01    74 2018-05-01 0.4793986
17    17  9     1 2019-01-01    84 2018-05-01 0.6456319
18    18  9     2 2019-01-01    84 2018-05-01 0.6456319
19    19 10     1 2019-01-01    94 2018-05-01 0.9330341
20    20 10     2 2019-01-01    94 2018-05-01 0.9330341
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341