I would like to join two data sets, A
and B
. I would like to join A
and B
exactly on their id
variables, but keep only the most recent observation in B
that is between three months and three years old.
The data sets are big enough that I need to use the sqldf
package (about 500,000 rows in A
and 250,000 rows in B
). It seems that the logic should be to LEFT OUTER JOIN A AND B
with A.id = B.id
and (A.date - B.date) BETWEEN 3*30 AND 3*365
, then GROUP BY A.row
, ORDER BY B.date DESC
, then keep first observation. But my code below keeps the first observation OVERALL, not the first observation per A.row
group.
I can do this join in two steps (one sqldf
, one tidyverse
), but can sqldf
do both steps?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
# Some toy data:
A <- tibble(id = rep(1:10, each = 2),
subid = rep(1:2, 10),
date = rep(ymd('2019-01-01'), 20))
A$row <- seq(nrow(A))
set.seed(42)
B <- tibble(id = rep(1:10, each = 10),
date = ymd('2015-01-01') + months(10*rep(1:10, 10)),
x = runif(100))
# This code properly matches A and B, but only returns the first observation OVERALL, not per A.row:
C <- sqldf('SELECT *
FROM A
LEFT OUTER JOIN B
ON A.id = B.id
AND (A.date - B.date) BETWEEN 3*30 and 3*365
GROUP BY row
ORDER BY B.date DESC
LIMIT 1') %>%
as_tibble()
C
#> # A tibble: 1 x 7
#> id subid date row id..5 date..6 x
#> <int> <int> <date> <int> <int> <dbl> <dbl>
#> 1 1 1 2019-01-01 1 1 17652 0.830
# I could do this in two steps, with the first step in sqldf and the second step in the tidyverse. This two step approach would work my data, because B has annual data, so there should not be more than three matches per row in A. However, it seems like I should be able to do the entire join in sqldf (and maybe one data I will not be able to do the second step in the tidyverse).
D <- sqldf('SELECT *
FROM A
LEFT OUTER JOIN B
ON A.id = B.id
AND (A.date - B.date) BETWEEN 3*30 and 3*365') %>%
as_tibble()
E <- D %>%
arrange(row, desc(date..6)) %>%
group_by(row) %>%
filter(row_number() == 1) %>%
ungroup()
# Below is the desired output. Can sqldf do both steps?
E
#> # A tibble: 20 x 7
#> id subid date row id..5 date..6 x
#> <int> <int> <date> <int> <int> <dbl> <dbl>
#> 1 1 1 2019-01-01 1 1 17652 0.830
#> 2 1 2 2019-01-01 2 1 17652 0.830
#> 3 2 1 2019-01-01 3 2 17652 0.255
#> 4 2 2 2019-01-01 4 2 17652 0.255
#> 5 3 1 2019-01-01 5 3 17652 0.947
#> 6 3 2 2019-01-01 6 3 17652 0.947
#> 7 4 1 2019-01-01 7 4 17652 0.685
#> 8 4 2 2019-01-01 8 4 17652 0.685
#> 9 5 1 2019-01-01 9 5 17652 0.974
#> 10 5 2 2019-01-01 10 5 17652 0.974
#> 11 6 1 2019-01-01 11 6 17652 0.785
#> 12 6 2 2019-01-01 12 6 17652 0.785
#> 13 7 1 2019-01-01 13 7 17652 0.566
#> 14 7 2 2019-01-01 14 7 17652 0.566
#> 15 8 1 2019-01-01 15 8 17652 0.479
#> 16 8 2 2019-01-01 16 8 17652 0.479
#> 17 9 1 2019-01-01 17 9 17652 0.646
#> 18 9 2 2019-01-01 18 9 17652 0.646
#> 19 10 1 2019-01-01 19 10 17652 0.933
#> 20 10 2 2019-01-01 20 10 17652 0.933
Created on 2019-07-12 by the reprex package (v0.3.0)