0

I'd like to use ntile from package dplyr to generate a vector of quantiles. The problem occurs when I have a low number of groups to divide my data into.

For example, if I have a vector of -1 and 1, the value -1 should be in quantile 1 and value 1 should be in quantile 2:

library(dplyr)
index2 <- rep(c(-1,1,-1),each=4) 
#[1] -1 -1 -1 -1  1  1  1  1 -1 -1 -1 -1

However, using ntile, the last two data points are in the wrong quantile (2 instead of 1)

ntile(index2,2)
# [1] 1 1 1 1 2 2 2 2 1 1 2 2

Here's the result I would expect for the index2 quantiles:

   #  1  1  1  1  2  2  2  2  1  1  1  1

I have the same problem with n=3. The results are not as expected.

index3 <- rep(c(-1,1,-2,-2),each=3)
#[1] -1 -1 -1  1  1  1 -2 -2 -2 -2 -2 -2
ntile(index3,3)
#[1] 2 2 3 3 3 3 1 1 1 1 2 2

Here's the result I would expect for the index3 quantiles:

#  2  2  2  3  3  3  1  1  1  1  1  1

I'm also open to a cut and quantile() solution.

Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
  • 2
    The `ntile()` function splits the data so each group has roughly the same number of values. If you had all 1's and asked for two groups, then 1/2 would be assigned to the first, and 1/2 to the second. It sounds like maybe you don't really want quantiles? If your values are already discrete, maybe just make them factors? I really don't know what behavior you expect here. – MrFlick Mar 14 '17 at 19:11
  • I'm expecting a cut vs. the median value, not the average value. So the number of elements are not expected to be the same in both quantiles. – Pierre Lapointe Mar 14 '17 at 19:12
  • So what if you have all 1's and you request 3 groups; what would the behavior be? – MrFlick Mar 14 '17 at 19:14
  • That would not happen. In my real data, the lowest number of different data points would be 2 and the quantiles would be 2 in that case. – Pierre Lapointe Mar 14 '17 at 19:17
  • Well it makes a difference in how one would program it. I don't know what assumptions you are willing to make from your examples. Perhaps i'm still distracted by the misuse of the term quantile. Perhaps you could be more explicit in the question itself. – MrFlick Mar 14 '17 at 19:19
  • 1
    Perhaps [this post about splitting a variable into groups](http://stackoverflow.com/questions/6104836/splitting-a-continuous-variable-into-equal-sized-groups/7965876#7965876) is relevant, the `cut_number` function in the __`ggplot2`__ package may be helpful – bouncyball Mar 14 '17 at 19:19
  • @MrFlick I added the desired results for index2 and index3 in the question. To be clear, I'm not expecting equal buckets. – Pierre Lapointe Mar 14 '17 at 19:24
  • Following MrFlick's first comment regarding factor conversion, you could do `factor(index3, labels=seq_along(unique(index3)))` or `as.integer(factor(index3, labels=seq_along(unique(index3))))` if you wanted integers. – lmo Mar 14 '17 at 19:30
  • @lmo Good idea but it wouldn't work with my real world data which sometimes has lots of uniques, sometimes not. This specific question is for the case of few uniques. – Pierre Lapointe Mar 14 '17 at 19:42
  • @bouncyball Unfortunately `table(cut_number(index2, 2))` does not work. I get:`Error: Insufficient data values to produce 2 bins.` – Pierre Lapointe Mar 14 '17 at 19:43

1 Answers1

4

How about this function

quant_cut <- function(x, n) {
    qs <- quantile(x, 1:(n-1)/n)
    brks <- c(-Inf, qs, Inf)
    cut(x, breaks=brks, labels=FALSE)
}

We calculate the quantile values, then use cut to break at those values (resulting in potentially uneven groupings). For example

index2 <- rep(c(-1,1,-1),each=4) 
quant_cut(index2, 2)
#  [1] 1 1 1 1 2 2 2 2 1 1 1 1

and

index3 <- rep(c(-1,1,-2,-2),each=3)
quant_cut(index3,3)
# [1] 2 2 2 3 3 3 1 1 1 1 1 1
MrFlick
  • 195,160
  • 17
  • 277
  • 295