Better way for generating groups based on many rules in R

Question

I have a dataset with many columns that, by each row value combination, determine a set of rules for a new value in another column. The different combinations are diverse, and not all columns are included for each rule. Also, some columns have organism names that tend to be quite long. Due to this, the current method I am using (case_when) becomes quite messy, and reviewing these rules becomes quite tedious.

I am wondering if there is a better way of doing this that is cleaner and easier to review? The dataset I run this on has over 70.000 observations, so below is a dummy dataset that can be used.

col1   col2   col3   col4     col5  col6
1      A      43     string1  AA    verylongnamehere
2      B      22     string2  BB    anotherlongname
3      C      15     string3  CC    yetanotherlongname
4      D      100    string4  DD    hereisanotherlongname
5      E      60     string5  EE    thisisthelastlongname


test <- data.frame(
  col1 = c(1,2,3,4,5),
  col2 = c("A","B","C","D","E"),
  col3 = c(43,22,15,100,60),
  col4 = c("string1","string2","string3","string4","string5"),
  col5 = c("AA","BB","CC","DD","EE"),
  col6 = c("verylongnamehere", "anotherlongname","yetanotherlongname","hereisanotherlongname","thisisthelastlongname")
)

The following code is an example of the rules and code I use:

library(dplyr)

test2 <- test %>%
  mutate(new_col = case_when(
    col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
    col3 >= 60 & col5 == "DD" ~ "result2",
    col1 %in% c(2,3,4) & 
     col2 %in% c("B","D") & 
     col5 %in% c("BB","CC","DD") & 
     col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
    TRUE ~ "result4"
  ))

This is often tricky and solutions depend on circumstances. If a handful of conditions I would try to devise a meaningful name for each condition, create a column of `TRUE/NA` based on that name and then `coalesce()` a new column. If a lot of conditions I might prefer to put the data in long form. A few questions: How big is the real data? How many conditions are there? Do you expect to have to change regularly the rules or number of conditions? Can a row meet two conditions (they don't seem mutually exclusive) and if so what is intended? Do you need to use `dplyr` rather than `data.table`? — SamR, Feb 07 '23 at 08:47
`case_when` might be your best choice here, if there are lots of conditions, like your example, it is generally as irreducible as what you show — Maël, Feb 07 '23 at 09:35

score 1 · Accepted Answer · answered Feb 07 '23 at 09:58

It might be easier to review the conditions if they were in a spreadsheet. Here is how you could read them from it and build your case_when.

Spreadsheet representation (conditions.xlsx): Note that == and %in% are considered as default and are not explicitly included here.

Load the conditions

library(readxl)
cond <- read_excel('conditions.xlsx')

dput(cond):

structure(list(Result = c("result1", "result2", "result3", "result4"
), col1 = c("1", NA, "c(2, 3, 4)", NA), col2 = c("\"A\"", NA, 
"c(\"B\",\"D\")", NA), col3 = c(NA, ">= 60", NA, NA), col4 = c(NA, 
NA, NA, NA), col5 = c(NA, "\"DD\"", "c(\"BB\",\"CC\",\"DD\")", 
NA), col6 = c("\"verylongnamehere\"", NA, "c(\"anotherlongname\",\"yetanotherlongname\")", 
NA)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

Process the conditions into a case_when command:

# separate conditions and results
results <- cond$Result
cond <- trimws(as.matrix(cond[, -1]))

# add default %in% operator for vectors
add.in <- grepl('^c\\(', cond)
cond[add.in] <- paste('%in%', cond[add.in])
# add default ==
add.equals <- grepl('^[^<>%!]', cond)
cond[add.equals] <- paste('==', cond[add.equals])

# add column names to conditions and join them together with ' & '
col.cond <- apply(cond, 1, \(x) {
  col.cond <- paste(colnames(cond), x)[!is.na(x)]
  paste(col.cond, collapse=' & ')
})
# put TRUE where no condition was given (default value)
col.cond[col.cond==''] <- 'TRUE'

# add results and join all together
case.when <- paste0(col.cond, ' ~ "', results, '"', collapse=',\n ')
# complete the case_when()
case.when <- paste('case_when(\n',
               case.when,
               '\n)')

case.when is your case_when command as a string:

cat(case.when)
# case_when(
#  col1 == 1 & col2 == "A" & col6 == "verylongnamehere" ~ "result1",
#  col3 >= 60 & col5 == "DD" ~ "result2",
#  col1 %in% c(2, 3, 4) & col2 %in% c("B","D") & col5 %in% c("BB","CC","DD") & col6 %in% c("anotherlongname","yetanotherlongname") ~ "result3",
#  TRUE ~ "result4" 
# )

And now we just parse it, evaluate and use inside mutate:

test2 <- test %>% 
  mutate(new_col = eval(parse(text=case.when)))

#   col1 col2 col3    col4 col5                  col6 new_col
# 1    1    A   43 string1   AA      verylongnamehere result1
# 2    2    B   22 string2   BB       anotherlongname result3
# 3    3    C   15 string3   CC    yetanotherlongname result4
# 4    4    D  100 string4   DD hereisanotherlongname result2
# 5    5    E   60 string5   EE thisisthelastlongname result4

Based on your example, I only considered conditions that use & as a logical operator. If | would be used as well, another column would have to be added in the spreadsheet for each data column specifying the logical operator (& or |) used for that condition. In the case of more complex conditions with parentheses, this approach would probably not be possible.

This is a good start - though personally I'd prefer a plain text set of rules. I am a bit hesitant about feeding the rules through several `gsub()` calls and `eval(parse())`. It feels to me that it would be quite easy for a bug to slip in unnoticed. I think it could be improved using tidy evaluation, and specifically with splicing and `rlang::parse_exprs()`, e.g. as [here](https://stackoverflow.com/questions/61789717/dplyr-case-when-with-dynamic-number-of-cases). — SamR, Feb 07 '23 at 13:50
@SamR There is definitely a risk of introducing bugs here. Hopefully, most bugs would lead to an error during evaluation and those that wouldn't would likely go unnoticed in any kind of representation. However, the person filling the spreadsheet has to be familiar with R and ideally the coder themself. The main point of this approach is to have the easily readable/editable *result* × *column* matrix of conditions which I can hardly imagine in a plain text format. — Robert Hacken, Feb 07 '23 at 14:36
When I say plain text I don't necessarily mean a text file. For me at least I think a yaml file of conditions would be the easiest to make sense of but people's brains work in different ways, and implementing that would have disadvantages as well. What I mean is, I don't see why the excel file couldn't be a csv. I do think that separating the rules from the code is a nice approach though. — SamR, Feb 07 '23 at 14:43

Better way for generating groups based on many rules in R

1 Answers1