1

I have data in a numerical format and I would like to label them into three levels - "low", "med", "high" based on their quantile

  1. low, <50% pencentile
  2. med, 50% < 75% pencentile
  3. high, > 75% pencentile

This is my R output

 quantile(data$crim)
       0%       25%       50%       75%      100% 
 0.006320  0.082045  0.256510  3.677083 88.976200 

I use mutate to label them, below is my R code. I got the "high" for all data.

 newdata<-mutate(data, crim.lev = ifelse(crim %in% 0:0.26, "low",
                                           ifelse(crim %in% 0.27:3.68, "med",
                                                   "high")))

Please let me anything I missed, or other method to do this. Thanks.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213

3 Answers3

4

I suggest cut is a better fit here. It only calculates the quantiles once, does not do row-by-row comparisons (as in the case of case_when), and no nested-ifelse.

quantile(mtcars$disp, c(0, 0.5, 0.75, 1))
#    0%   50%   75%  100% 
#  71.1 196.3 326.0 472.0 

mtcars %>%
  mutate(lev = cut(disp, c(-Inf, quantile(disp, c(0, 0.5, 0.75, 1))[-1]), 
                   label=c("low", "med", "high"))) %>%
  select(disp, lev)
#     disp  lev
# 1  160.0  low
# 2  160.0  low
# 3  108.0  low
# 4  258.0  med
# 5  360.0 high
# 6  225.0  med
# 7  360.0 high
# 8  146.7  low
# 9  140.8  low
# 10 167.6  low
# 11 167.6  low
# 12 275.8  med
# 13 275.8  med
# 14 275.8  med
# 15 472.0 high
# 16 460.0 high
# 17 440.0 high
# 18  78.7  low
# 19  75.7  low
# 20  71.1  low
# 21 120.1  low
# 22 318.0  med
# 23 304.0  med
# 24 350.0 high
# 25 400.0 high
# 26  79.0  low
# 27 120.3  low
# 28  95.1  low
# 29 351.0 high
# 30 145.0  low
# 31 301.0  med
# 32 121.0  low

The use of -Inf is because cut is typically left-open, and its only alternative is right-open (so either the min or the max will be omitted ... so I replace the min with something that should be below the actual values).

r2evans
  • 141,215
  • 6
  • 77
  • 149
2

We can use case_when with quantile function to divide data into groups. Using it with built-in mtcars dataset on mpg column.

library(dplyr)

mtcars %>%
  mutate(group = case_when(mpg < quantile(mpg,0.5) ~ 'low', 
                      between(mpg, quantile(mpg, 0.5), quantile(mpg, 0.75))~'med', 
                      TRUE ~ 'high'))


#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb group
#1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   med
#2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4   med
#3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1   med
#4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   med
#5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   low
#6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1   low
#7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4   low
#8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2  high
#....
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

That's what cut() is designed for. Example:

res <- cut(iris$Sepal.Length, breaks=c(0, quantile(iris$Sepal.Length)), labels=seq(0, 1, .25))
head(res)
# [1] 0.25 0.25 0.25 0.25 0.25 0.5 
# Levels: 0 0.25 0.5 0.75 1
jay.sf
  • 60,139
  • 8
  • 53
  • 110