I have found this kind of problem a couple of times, but I can not figure out a better (more efficient) way to do it.
We have a dataframe df
with values y
sorted by date dt
and one or more categories x
. For example, stock market data for many instrument symbols ("AAPL34", "MSFT34", etc).
Given another dataframe search
with each row containing a date interval (min_dt
and max_dt
) for some category
from df$x
, I want to filter df
in this category and interval and output summary quantities (like mean, median or whatever) for each row of search
.
I have solved the problem as in the reprex below, but I feel it is very slow (real data usually involves 10-100 millions rows for df
and thousands of rows for search
).
library(dplyr)
library(lubridate)
df <- tibble(dt = seq(ymd("2019-01-01"), ymd("2019-12-26"), by = "day"),
x = rep(letters[1:3], 120),
y = rnorm(360))
search <- tibble(category = c("a", "a", "b", "c"),
min_dt = ymd(c("2019-02-02", "2019-06-06", "2019-08-08", "2019-12-12")),
max_dt = ymd(c("2019-04-04", "2019-07-07", "2019-11-11", "2019-12-30")))
# My current solution
filter_summarise <- function(category, min_dt, max_dt) {
df %>%
filter(x == category, dt > min_dt, dt < max_dt) %>%
summarise(n = n(), mean_val = mean(y))
}
bind_cols(search, purrr::pmap_dfr(search, filter_summarise))
# A tibble: 4 x 5
category min_dt max_dt n mean_val
<chr> <date> <date> <int> <dbl>
1 a 2019-02-02 2019-04-04 20 0.0618
2 a 2019-06-06 2019-07-07 10 0.170
3 b 2019-08-08 2019-11-11 32 -0.127
4 c 2019-12-12 2019-12-30 5 -0.345
I think the issue is that the function creates a copy for each purrr::map
iteration, but I do not see a way out of doing this. Any idea is appreciated.