0

In this example I have a tibble with two variables:

  • a group variable gr
  • the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10), 
             val = gr + rnorm(30))

Goal

I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.

What I did

I first construct a nested tibble containing the vectors of breakpoints for each value of gr:

df_breakpoints <- bind_cols(gr = 1:3, 
                            purrr::map_dfr(1:3, function(gr) {
                              c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
                              })) %>% 
  nest(bp = -gr) %>% 
  mutate(bp = purrr::map(.$bp, unlist))

Then I join it with df:

df <- inner_join(df, df_breakpoints, by = "gr")

My first guess to define the discretized variable lvl was

df %>% mutate(lvl = findInterval(x = val, vec = bp))

It produces the error

Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.

Then I tried

df$lvl <- purrr::imap_dbl(1:nrow(df),
                               ~findInterval(x = df$val[.x], vec = df$bp[[.x]]))

or

df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))

It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?

Augustin
  • 307
  • 2
  • 10
  • Several alternatives: [How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame](https://stackoverflow.com/questions/4126326/how-to-quickly-form-groups-quartiles-deciles-etc-by-ordering-columns-in-a), including the alternative showed below, and the convenience function `dplyr::ntile`. Just do `group_by` first. – Henrik Aug 04 '21 at 11:16

1 Answers1

2

You can do this in group_by + mutate step -

library(dplyr)

df %>%
  group_by(gr) %>%
  mutate(breakpoints = findInterval(val, 
                       c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
  ungroup

#      gr    val breakpoints
#   <int>  <dbl>       <int>
# 1     1  0.440           1
# 2     1  0.770           2
# 3     1  2.56            4
# 4     1  1.07            3
# 5     1  1.13            3
# 6     1  2.72            4
# 7     1  1.46            4
# 8     1 -0.265           1
# 9     1  0.313           1
#10     1  0.554           2
# … with 20 more rows

findInterval is applied for each gr separately.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213