1

I have a data frame containing two columns: a grouping variable and a interval period over which the grouping variable holds. I have another data frame with a date column and a value column. How can I join these two tables together somewhat efficiently with dplyr+tidyverse functions?

library(dplyr)
library(lubridate)
ty <- data_frame(date = mdy(paste(1, 1 + seq(20), 2017, sep = "/")), 
                 y = c(rnorm(7), rnorm(7, mean = 2), rnorm(6, mean = -1)))
gy <- data_frame(period = interval(mdy(c("01/01/2017", "01/08/2017", "01/15/2017")), 
                                   mdy(c("01/07/2017", "01/14/2017", "01/20/2017"))), 
                          batch = c(1, 2, 3))

I want to build the table that is equivalent to:

ty %>% mutate(batch = c(rep(1, 7), rep(2, 7), rep(3, 6)))

Ideally, this should work reasonably quickly on data sets of up to 1,000,000 rows. Better still if it works on 100,000,000 :).

wdkrnls
  • 4,548
  • 7
  • 36
  • 64
  • 3
    this does not answer your question (`dplyr`), but as a comment -- you should check out [rolling joins in `data.table`](https://stackoverflow.com/questions/12030932/rolling-joins-data-table-in-r) – C8H10N4O2 Jun 06 '17 at 21:07

2 Answers2

1

How about:

ty %>% 
  mutate(batch = case_when(
  ty$date %within% gy$period[1] ~gy$batch[1],
  ty$date %within% gy$period[2] ~gy$batch[2],
  ty$date %within% gy$period[3] ~gy$batch[3]))

You would obviously need to define the case_when intervals. How many have you got? I've used cat and paste0 with good effect for that in the past.

Edited to reflect OP's comments. This should take care of the NSE and would allow the generation of the case_when intervals programatically:

ty %>%
  mutate(batch = eval(parse(text = paste0("case_when(",
                                      paste(
                                        paste0(
                                          "ty$date %within% gy$period[",
                                          seq_along(gy$period),
                                          "] ~gy$batch[",
                                          seq_along(gy$period),
                                          "]"
                                        ),
                                        collapse = ", "
                                      ), ")"))))
biomiha
  • 1,358
  • 2
  • 12
  • 25
  • This is a good idea for a relatively small number of batches. I have ~ 10,000, though. With all the NSE, I worry that this wouldn't work with paste and eval, but maybe I'm wrong :) – wdkrnls Jun 06 '17 at 21:35
0

This is the best I could come up with so far:

ty$batch <- unlist(lapply(ty$date, function(d) gy$batch[which(d %within% gy$period)]), recursive = FALSE, use.names = FALSE)

But it doesn't seem very fast.

wdkrnls
  • 4,548
  • 7
  • 36
  • 64