Filtering and summarising a dataframe based another search dataframe

Question

I have found this kind of problem a couple of times, but I can not figure out a better (more efficient) way to do it.

We have a dataframe df with values y sorted by date dt and one or more categories x. For example, stock market data for many instrument symbols ("AAPL34", "MSFT34", etc).

Given another dataframe search with each row containing a date interval (min_dt and max_dt) for some category from df$x, I want to filter df in this category and interval and output summary quantities (like mean, median or whatever) for each row of search.

I have solved the problem as in the reprex below, but I feel it is very slow (real data usually involves 10-100 millions rows for df and thousands of rows for search).

library(dplyr)
library(lubridate)

df <- tibble(dt = seq(ymd("2019-01-01"), ymd("2019-12-26"), by = "day"),
             x = rep(letters[1:3], 120),
             y = rnorm(360))

search <- tibble(category = c("a", "a", "b", "c"),
                 min_dt = ymd(c("2019-02-02", "2019-06-06", "2019-08-08", "2019-12-12")),
                 max_dt = ymd(c("2019-04-04", "2019-07-07", "2019-11-11", "2019-12-30")))

# My current solution
filter_summarise <- function(category, min_dt, max_dt) {
  df %>% 
    filter(x == category, dt > min_dt, dt < max_dt) %>% 
    summarise(n = n(), mean_val = mean(y))
}

bind_cols(search, purrr::pmap_dfr(search, filter_summarise))


# A tibble: 4 x 5
  category min_dt     max_dt         n mean_val
  <chr>    <date>     <date>     <int>    <dbl>
1 a        2019-02-02 2019-04-04    20   0.0618
2 a        2019-06-06 2019-07-07    10   0.170 
3 b        2019-08-08 2019-11-11    32  -0.127 
4 c        2019-12-12 2019-12-30     5  -0.345

I think the issue is that the function creates a copy for each purrr::map iteration, but I do not see a way out of doing this. Any idea is appreciated.

You might be able to join by interval such as [here](https://stackoverflow.com/q/23371747/5325862) or [here](https://stackoverflow.com/q/37289405/5325862). I'm not well-versed in `data.table`, but other posts use it for similar operations, such as [here](https://stackoverflow.com/q/27487949/5325862) and [here](https://stackoverflow.com/q/50213099/5325862) — camille, Dec 27 '19 at 18:28
Also, using `dt > min_dt` means that dates that are the same as the minimum date *won't* be included; is that intentional? If not, it should be `>=` — camille, Dec 27 '19 at 18:36
@camille Thanks for your reply. I am not used to `data.table`, but maybe this problem calls for it. I will take a look at those questions. As for your second comment, I am using `>` just for this example, even though, yes, it would make more sense to use `>=`. — Gabriel M. Silva, Dec 27 '19 at 18:46

score 2 · Accepted Answer · answered Dec 27 '19 at 18:50

2

I reproduced the same results using data.table but it actually performs worse than OPs solution. Leaving it here in case it helps other people answer:

library(data.table)
setDT(df)
setDT(search)

df[search,
   on = .(dt > min_dt, dt < max_dt, x = category),
   .(min_dt,max_dt,dt,x,y,category)][,list(.N, mean_val = mean(y)),
                                        by = list(min_dt,max_dt,category)]

Benchmark:

dt_summ = function(df,search){
  setDT(df)
  setDT(search)

  setkeyv(df,c("dt","y"))

  df[search,
     on = .(dt > min_dt, dt < max_dt, x = category),
     .(min_dt,max_dt,dt,x,y,category)][,
                                          list(.N, mean_val = mean(y)),
                                          by = list(min_dt,max_dt,category)]
}


dplyr_summ = function(df,search){
  bind_cols(search, purrr::pmap_dfr(search, filter_summarise))
}

library(microbenchmark)
microbenchmark(
  dplyr = dplyr_summ(df,search),
  dt = dt_summ(df,search)
)

#Unit: milliseconds
#  expr    min     lq     mean  median     uq     max neval
# dplyr 4.0562 4.4588 5.580925 4.70385 5.0531 65.5202   100
#    dt 6.7754 7.5449 8.246862 7.97395 8.6485 15.8260   100

answered Dec 27 '19 at 18:50

Fino

1,774
11
21

I tested your code with `data.table 1.12.9 dev` and it ran 3 times faster than `dplyr 0.8.99.9000 dev` on a 6-core i7 8th gen PC. – Tung Dec 28 '19 at 01:46
Sorry for the late response. I just tested your `data.table` alternative (adapted to my real problem) and it runs approximately 60 times faster than my `dplyr` code (from ~40 min to 40 seconds)! I think the runtime difference grows with the number of rows in `df` and `search`. Awesome, thank you. Time to learn `data.table` (or "cheat" with `dtplyr`). – Gabriel M. Silva Jan 02 '20 at 19:55
The only "problem" is that the summary for `.N` gives 1 instead of 0 when there are no observations. It is easy to correct but I don't understand why it happens. – Gabriel M. Silva Jan 02 '20 at 20:07
@Tung so weird. I tested it on my home and work computers and the results were the same. My versions of `data.table` and `dplyr` are older though. – Fino Jan 02 '20 at 21:55

Filtering and summarising a dataframe based another search dataframe

1 Answers1