45

I want to create a new variable with 3 arbitrary categories based on continuous data.

set.seed(123)
df <- data.frame(a = rnorm(100))

Using base I would

df$category[df$a < 0.5] <- "low"
df$category[df$a > 0.5 & df$a < 0.6] <- "middle"
df$category[df$a > 0.6] <- "high"

Is there a dplyr, I guess mutate(), solution for this?

Furthermore, is there a way to calculate the categories rather than choosing them? I.e. let R calculate where the breaks for the categories should be.

EDIT

The answer is in this thread, however, it does not involve labelling, which confused me (and may confuse others) therefore I think this question serves a purpose.

Community
  • 1
  • 1
FilipW
  • 1,412
  • 1
  • 13
  • 25
  • 6
    Try using `cut`. See `?cut`. – aichao Nov 02 '16 at 12:37
  • 6
    answer is here http://stackoverflow.com/questions/23163567/r-dplyr-categorize-numeric-variable-with-mutate – gfgm Nov 02 '16 at 12:44
  • 2
    @GabrielFGeislerMesevage sure, I read that one, however, it did not involve the issue of labels that both Robert and aichao mentioned below. For a beginner, like myself, I think that this thread serves a purpose. Correct me if I'm wrong. – FilipW Nov 02 '16 at 14:23
  • 3
    dplyr provides a neat solution for this through the `case_when()` function. https://dplyr.tidyverse.org/reference/case_when.html – FilipW Apr 27 '18 at 09:08

2 Answers2

68

To convert from numeric to categorical, use cut. In your particular case, you want:

df$category <- cut(df$a, 
                   breaks=c(-Inf, 0.5, 0.6, Inf), 
                   labels=c("low","middle","high"))

Or, using dplyr:

library(dplyr)
res <- df %>% mutate(category=cut(a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high")))
##               a category
##1   -0.560475647      low
##2   -0.230177489      low
##3    1.558708314     high
##4    0.070508391      low
##5    0.129287735      low
## ...
##35   0.821581082     high
##36   0.688640254     high
##37   0.553917654   middle
##38  -0.061911711      low
##39  -0.305962664      low
##40  -0.380471001      low
## ...
##96  -0.600259587      low
##97   2.187332993     high
##98   1.532610626     high
##99  -0.235700359      low
##100 -1.026420900      low
helcode
  • 1,859
  • 1
  • 13
  • 32
aichao
  • 7,375
  • 3
  • 16
  • 18
  • 2
    Also, if you want the resulting categories to be ordered, set `cut(..., ordered_result = TRUE)`. – HBat Feb 22 '20 at 21:09
8

using quantiles for cut

xs=quantile(df$a,c(0,1/3,2/3,1))
#xs[1]=xs[1]-.00005
#df1 <- df %>% mutate(category=cut(a, breaks=xs, labels=c("low","middle","high")))
df1 <- df %>% mutate(category=cut(a, breaks=xs, labels=c("low","middle","high"),include.lowest = TRUE))
boxplot(df1$a~df1$category,col=3:5)

enter image description here

Robert
  • 5,038
  • 1
  • 25
  • 43