3

I have a relatively large (~2.5 million record) data frame structured more or less like this:

df <- tibble(year = rep(2000:2009, times = 10), value = sample(0:1000, 100)) %>% 
  arrange(year)

I want to assign each value to a bucket based on quantiles within each year (e.g. observation 1 is in the bottom quartile of values in the year 2000). For my trivial example, this works fine:

df %>% 
  group_by(year) %>% 
  mutate(bucket = as.numeric(cut(value, 
                             breaks = quantile(value, probs = c(0, .25, .5, .75, 1)), 
                             include.lowest = T)))

df

# A tibble: 100 x 3
# Groups:   year [10]
    year value bucket
   <int> <int>  <dbl>
 1  2000   281      1
 2  2000   336      2
 3  2000   873      4
 4  2000    41      1
 5  2000   335      1
 6  2000   939      4
 7  2000   746      3
 8  2000   762      4
 9  2000   345      2
10  2000   628      3

But on my full dataset, that (unsurprisingly) takes forever. To speed things up, I calculated the quantiles for each year in a separate dataframe:

break_calc <- function(yr) {
  library(magrittr)
  df %>% 
    filter(year == yr) %$%
    quantile(value, probs = c(0, .25, .5, .75, 1))
}

df_quants <- tibble(year = 2000:2009) %>% 
  mutate(breaks = map(year, ~break_calc(.x)))

But even with that, I'm struggling to come up with a solution that doesn't take forever. This was very slow on my full data:

df %>% 
  mutate(bucket = map2_dbl(value, year,
                       ~cut(.x, breaks = unlist(df_quants$breaks[df_quants$year == .y]), include.lowest = T)))

This is maybe a bit faster, but not great:

df %>% 
  left_join(df_quants, by = "year") %>% 
  mutate(bucket = map2_dbl(value, breaks, ~cut(.x, breaks = unlist(.y), include.lowest = T)))

Any ideas on how to optimize this? Strong preference for keeping it in the dplyr/tidyverse universe, though I'm open to apply/data.table solutions if they're meaningfully faster.

benc
  • 376
  • 1
  • 6
  • How important is it that the quantiles be exact? You might be able to just sort by year and value, and divide each year into 4 based on number of rows as an approximation. – Marius Jan 23 '19 at 22:59
  • That's a good thought @Marius. As it happens, though, my actual calculation is more complicated than this -- it's actually a weighted quantile (because this is survey microdata), so hard to eyeball. – benc Jan 24 '19 at 22:03

1 Answers1

2

How about using data.table and ntile from dplyr?

library(dplyr)
library(data.table)
df <- as.data.table(df)
df[, bucket:=ntile(value,4), by=year]

You can alse use ntile with your code, but I find data.table faster and cleaner

Using quantile:

library(dplyr) 
library(data.table)
df <- as.data.table(df)
df[, bucket:= as.integer(cut(value, 
                            breaks = quantile(value, probs = c(0, .25, .5, .75, 1)), 
                            include.lowest = T)), by=year]
LocoGris
  • 4,432
  • 3
  • 15
  • 30
  • 2
    Just be aware that `ntile` and `quantile` will not give you exactly the same results. – joran Jan 23 '19 at 22:56
  • @joran is totally right, see https://stackoverflow.com/questions/41653562/difference-between-ntile-and-cut-and-then-quantile-function-in-r – LocoGris Jan 23 '19 at 22:59