24

When doing data analysis, I sometimes need to recode values to factors in order to carry out groups analysis. I want to keep the order of factor same as the order of conversion specified in case_when. In this case, the order should be "Excellent" "Good" "Fail". How can I achieve this without tediously mention it again as in levels=c('Excellent', 'Good', 'Fail')?

Thank you very much.


library(dplyr, warn.conflicts = FALSE)             
                                                   
set.seed(1234)                                     
score <- runif(100, min = 0, max = 100)     
   
Performance <- function(x) {                       
  case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  ) %>% factor(levels=c('Excellent', 'Good', 'Fail'))
}                                                  
                                                   
performance <- Performance(score)                  
levels(performance)                                
#> [1] "Excellent" "Good"      "Fail"
table(performance)                                 
#> performance
#> Excellent      Good      Fail 
#>        15        30        55
Yunnosch
  • 26,130
  • 9
  • 42
  • 54
  • 1
    that's what he doesn't want to do (and is already doing) – De Novo Mar 30 '18 at 10:17
  • 1
    That's a nice solution! – Luke Hayden Mar 31 '18 at 17:48
  • Beautiful, thank you for this! – jzadra Feb 13 '20 at 21:30
  • 2
    To allow for expressions on the RHS, insert `levels = sapply(levels, FUN = eval)` on the second-to-last line. This makes it possible to do `result = fct_case_when(x < 5 ~ my_vec[3])` and not getting "my_vec[3]" as `result`. – Jonas Lindeløv Sep 18 '20 at 08:45
  • Please do not edit solution announcements into the question. Accept (i.e. click the "tick" next to it) one of the existing answer, if there are any. You can also create your own answer, and even accept it, if your solution is not yet covered by an existing answer. Compare https://stackoverflow.com/help/self-answer – Yunnosch Sep 24 '21 at 13:45

5 Answers5

10

My Solution

Finally, I came up with a solution. For those who are interested, here is my solution. I wrote a function fct_case_when (pretend being a function in forcats). It is just a wrapper of case_when with factor output. The order of levels is the same as the argument order.


fct_case_when <- function(...) {
  args <- as.list(match.call())
  levels <- sapply(args[-1], function(f) f[[3]])  # extract RHS of formula
  levels <- levels[!is.na(levels)]
  factor(dplyr::case_when(...), levels=levels)
}

Now, I can use fct_case_when in place of case_when, and the result will be the same as the previous implementation (but less tedious).


Performance <- function(x) {                       
  fct_case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  )
}      
performance <- Performance(score)                  
levels(performance)                       
#> [1] "Excellent" "Good"      "Fail"
table(performance)                
#> performance
#> Excellent      Good      Fail 
#>        15        30        55
4

levels are set in lexicographic order by default. If you don't want to specify them, you can set them up so that lexicographic order is correct (Performance1), or create a levels vector once, and use it when generating the factor and when setting the levels (Performance2). I don't know how much effort or tediousness either of these would save you, but here they are. Take a look at my 3rd recommendation for what I think would be the least tedious way.

Performance1 <- function(x) {                       
  case_when(
    is.na(x) ~ NA_character_,                          
    x > 80 ~ 'Excellent',  
    x <= 50 ~ 'Fail',
    TRUE ~ 'Good',
  ) %>% factor()
}

Performance2 <- function(x, levels = c("Excellent", "Good", "Fail")){
  case_when(
    is.na(x) ~ NA_character_,
    x > 80 ~ levels[1],
    x > 50 ~ levels[2],
    TRUE ~ levels[3]
  ) %>% factor(levels)
}
performance1 <- Performance1(score)
levels(performance1)
# [1] "Excellent" "Fail"     "Good"
table(performance1)
# performance1
# Excellent      Fail      Good 
#        15        55        30 

performance2 <- Performance2(score)
levels(performance2)
# [1] "Excellent" "Good"      "Fail"  
table(performance2)
# performance2
# Excellent      Good      Fail 
#        15        30        55 

If I could suggest an even less tedious way:

performance <- cut(score, breaks = c(0, 50, 80, 100), 
                   labels = c("Fail", "Good", "Excellent"))
levels(performance)
# [1] "Fail"      "Good"      "Excellent"
table(performance)
# performance
#      Fail      Good Excellent 
#        55        30        15
De Novo
  • 7,120
  • 1
  • 23
  • 39
  • I think `Performace2` is close to what I need. Is there any function in `dplyr` or `forcats` that can do this in one step? That is, without saving the levels first. Also, the `cut` function is handy for conversion of numerical values to factors, although it reverses the order in this case (can be easily corrected using `forcats::fct_rev`). Thanks. –  Mar 30 '18 at 11:04
  • 1
    I think the disadvantage of `Performance2` is that we can't immediately see the corresponding conversion. For example, when seeing `x > 80 ~ levels[1]`, we have to look for the `levels` vector and see what its first element is in order to find out `x > 80` corresponds to `Excellent`. So it is handy for programming but in my opinion reduces readability. It would be great if someone can provide a solution which is programming-friendly and readable at the same time. –  Mar 30 '18 at 11:23
1

While my solution replaces your piping with a messy intermediate variable, this works:

    library(dplyr, warn.conflicts = FALSE)             

set.seed(1234)                                     
score <- runif(100, min = 0, max = 100)     

Performance <- function(x) {                       
  t <- case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  ) 
  to <- subset(t, !duplicated(t))
  factor(t, levels=(to[order(subset(x, !duplicated(t)), decreasing=T)] ))
}                                                  
performance <- Performance(score)                
levels(performance)  

Edited to fix!

Luke Hayden
  • 692
  • 4
  • 8
  • This doesn't work. It produces the error `factor level [2] is duplicated`. –  Mar 30 '18 at 10:28
  • This works. But seems complicated and does not save typing much. Anyway thanks! –  Mar 30 '18 at 10:57
  • I found that this doesn't work generally. For example, when score is `rbinom(10, size = 9, prob = .5)` and the conditions changed to `x %% 2 == 1 ~ 'Odd', x %% 2 == 0 ~ 'Even'`, sometimes the order of levels is `Odd Even`, but sometimes is `Even Odd`, which is not always the same order specified in `case_when`. You are using `order` so I guess this method works only when the values have a reasonable order. –  Mar 30 '18 at 12:05
  • Hmmm. I think that a better way to go about this might be to create a list containing two vectors, one with the ordered thresholds, the other with the factors describing the conditions, then providing this list as an argument to the function. This would allow you to make the function fully generalisable, if that is what you're after. – Luke Hayden Mar 30 '18 at 12:16
1

This is an implementation I have been using:

library(dplyr)
library(purrr)
library(rlang)
library(forcats)

factored_case_when <- function(...) {
  args <- list2(...)
  rhs <- map(args, f_rhs)
  
  cases <- case_when(
    !!!args
  )
  
  exec(fct_relevel, cases, !!!rhs)
}


numbers <- c(2, 7, 4, 3, 8, 9, 3, 5, 2, 7, 5, 4, 1, 9, 8)

factored_case_when(
  numbers <= 2 ~ "Very small",
  numbers <= 3 ~ "Small",
  numbers <= 6 ~ "Medium",
  numbers <= 8 ~ "Large",
  TRUE    ~ "Huge!"
)
#>  [1] Very small Large      Medium     Small      Large      Huge!     
#>  [7] Small      Medium     Very small Large      Medium     Medium    
#> [13] Very small Huge!      Large     
#> Levels: Very small Small Medium Large Huge!

This has the advantage of not having to manually spoecify the factor levels.

I have also submitted a feature request to dplyr for this functionality: https://github.com/tidyverse/dplyr/issues/6029

snakeoilsales
  • 113
  • 1
  • 6
0

Let case_when() output numbers and use the labels argument in factor():

library(dplyr, warn.conflicts = FALSE)
set.seed(1234)
score <- runif(100, min = 0, max = 100)

Performance <- function(x) {
  case_when(
    is.na(x) ~ NA_real_,
    x > 80   ~ 1,
    x > 50   ~ 2,
    TRUE     ~ 3
  ) %>% factor(labels=c('Excellent', 'Good', 'Fail'))
}

performance <- Performance(score)
levels(performance)
#> [1] "Excellent" "Good"      "Fail"
table(performance)
#> performance
#> Excellent      Good      Fail 
#>        15        30        55

Created on 2023-01-13 with reprex v2.0.2

its.me.adam
  • 333
  • 2
  • 11