Efficiently bucketing data by group in R

Question

I have a relatively large (~2.5 million record) data frame structured more or less like this:

df <- tibble(year = rep(2000:2009, times = 10), value = sample(0:1000, 100)) %>% 
  arrange(year)

I want to assign each value to a bucket based on quantiles within each year (e.g. observation 1 is in the bottom quartile of values in the year 2000). For my trivial example, this works fine:

df %>% 
  group_by(year) %>% 
  mutate(bucket = as.numeric(cut(value, 
                             breaks = quantile(value, probs = c(0, .25, .5, .75, 1)), 
                             include.lowest = T)))

df

# A tibble: 100 x 3
# Groups:   year [10]
    year value bucket
   <int> <int>  <dbl>
 1  2000   281      1
 2  2000   336      2
 3  2000   873      4
 4  2000    41      1
 5  2000   335      1
 6  2000   939      4
 7  2000   746      3
 8  2000   762      4
 9  2000   345      2
10  2000   628      3

But on my full dataset, that (unsurprisingly) takes forever. To speed things up, I calculated the quantiles for each year in a separate dataframe:

break_calc <- function(yr) {
  library(magrittr)
  df %>% 
    filter(year == yr) %$%
    quantile(value, probs = c(0, .25, .5, .75, 1))
}

df_quants <- tibble(year = 2000:2009) %>% 
  mutate(breaks = map(year, ~break_calc(.x)))

But even with that, I'm struggling to come up with a solution that doesn't take forever. This was very slow on my full data:

df %>% 
  mutate(bucket = map2_dbl(value, year,
                       ~cut(.x, breaks = unlist(df_quants$breaks[df_quants$year == .y]), include.lowest = T)))

This is maybe a bit faster, but not great:

df %>% 
  left_join(df_quants, by = "year") %>% 
  mutate(bucket = map2_dbl(value, breaks, ~cut(.x, breaks = unlist(.y), include.lowest = T)))

Any ideas on how to optimize this? Strong preference for keeping it in the dplyr/tidyverse universe, though I'm open to apply/data.table solutions if they're meaningfully faster.

How important is it that the quantiles be exact? You might be able to just sort by year and value, and divide each year into 4 based on number of rows as an approximation. — Marius, Jan 23 '19 at 22:59
That's a good thought @Marius. As it happens, though, my actual calculation is more complicated than this -- it's actually a weighted quantile (because this is survey microdata), so hard to eyeball. — benc, Jan 24 '19 at 22:03

LocoGris · Answer 1 · 2019-01-23T23:07:13.867

2

How about using data.table and ntile from dplyr?

library(dplyr)
library(data.table)
df <- as.data.table(df)
df[, bucket:=ntile(value,4), by=year]

You can alse use ntile with your code, but I find data.table faster and cleaner

Using quantile:

library(dplyr) 
library(data.table)
df <- as.data.table(df)
df[, bucket:= as.integer(cut(value, 
                            breaks = quantile(value, probs = c(0, .25, .5, .75, 1)), 
                            include.lowest = T)), by=year]

edited Jan 23 '19 at 23:07

answered Jan 23 '19 at 22:54

LocoGris

4,432
3
15
30

2

Just be aware that `ntile` and `quantile` will not give you exactly the same results. – joran Jan 23 '19 at 22:56
@joran is totally right, see https://stackoverflow.com/questions/41653562/difference-between-ntile-and-cut-and-then-quantile-function-in-r – LocoGris Jan 23 '19 at 22:59

Efficiently bucketing data by group in R

1 Answers1