dplyr create a new column with a complex user defined function of other columns

Question

I have a large data frame with responses to 40 questions (reprex with 3 questions below) and need to compute a new column that is a complex function of these 40 responses. As it is virtually impossible to write out the function within mutate, I tried creating a function f that could be used within mutate

df <- data.frame(Sex = c(rep("F", 5), rep("M", 5)),
                 Q1  = sample(0:10, 10, replace=T),
                 Q2  = sample(0:10, 10, replace=T),
                 Q3  = sample(0:10, 10, replace=T)
)

f <- function(q1, q2, q3){
  y <- q1 + (q2^2) - (q3^3)
  return(y)
}

Now creating a new column using mutate works fine.:

df %>%
   mutate(newcol = f(Q1, Q2, Q3))

  Sex Q1 Q2 Q3 newcol
1    F 10  6  3     19
2    F  0  9  9   -648
3    F  8  1  2      1
4    F  0  4  7   -327
5    F  6  4  1     21
6    M  8  3  3    -10
7    M  2  2  0      6
8    M 10  0  3    -17
9    M  6  9  3     60
10   M  1  7  2     42

as does

 df$newcol <- mapply(f, df$Q1, df$Q2, df$Q3)

But if I include even a simple if atatement in f as follows

f <- function(q1, q2, q3){
  y <- q1 + (q2^2) - (q3^3)
  if(y<0){
    y <- -y
  }
  return(y)
}

I immediately have a disaster on my hands:

df %>%
+   mutate(newcol = f(Q1, Q2, Q3))
   Sex Q1 Q2 Q3 newcol
1    F 10  6  3     19
2    F  0  9  9   -648
3    F  8  1  2      1
4    F  0  4  7   -327
5    F  6  4  1     21
6    M  8  3  3    -10
7    M  2  2  0      6
8    M 10  0  3    -17
9    M  6  9  3     60
10   M  1  7  2     42
Warning message:
Problem with `mutate()` input `newcol`.
i the condition has length > 1 and only the first element will be used
i Input `newcol` is `f(Q1, Q2, Q3)`.

However,

df$newcol <- mapply(f, df$Q1, df$Q2, df$Q3)
df
   Sex Q1 Q2 Q3 newcol
1    F 10  6  3     19
2    F  0  9  9    648
3    F  8  1  2      1
4    F  0  4  7    327
5    F  6  4  1     21
6    M  8  3  3     10
7    M  2  2  0      6
8    M 10  0  3     17
9    M  6  9  3     60
10   M  1  7  2     42

continues to work. Unfortunately, there are lots of if's in my function, and with 40 different arguments to pass to the function, the input to mapply becomes enormous. How can I pass my questions to mapply using a predefined vector, say something like

questions <- c("df$Q1", "df$Q2", "df$Q3") 
df$newcol <- mapply(f, questions)

Closely related: How do I define a function with 40 arguments without it running off the page?

It's enitrely possible that I am barking up the wrong tree, and if so, how ought I to go about solving my problem?

Many thanks in advance

Thomas Philips

P.S. Here is the real criterion

if(!is.na(df[i, "Q1_Daily_Mean"]) & df[i, "Q1_Daily_Mean"] >= THRESHOLD_MDD_GAD){
  anxiety <- TRUE
}

if(!is.na(df[i, "Q2_Daily_Mean"]) & df[i, "Q2_Daily_Mean"] >= THRESHOLD_MDD_GAD){
  worry <- TRUE
}

if(anxiety && worry){
  anxiety_and_worry <- TRUE
}

if(!is.na(df[i, "Q3_Daily_Mean"]) & df[i, "Q3_Daily_Mean"] >= THRESHOLD_MDD_GAD ){
  agitation <- TRUE
}

if(!is.na(df[i, "Q10_Daily_Mean"]) & df[i, "Q10_Daily_Mean"] >= THRESHOLD_MDD_GAD ){
  anger <- TRUE
}

if(!is.na(df[i, "Q2_Weekly"]) & df[i, "Q2_Weekly"] >= THRESHOLD_MDD_GAD ){
  physical_fatigue <- TRUE
}

if(!is.na(df[i, "Q5_Weekly"]) & df[i, "Q5_Weekly"] >= THRESHOLD_MDD_GAD ){
  no_concentration <- TRUE
}

if(!is.na(df[i, "Q7_Weekly"]) & df[i, "Q7_Weekly"] >= THRESHOLD_MDD_GAD ){
  disturbed_sleep <- TRUE
}

if(!is.na(df[i, "Q13_Weekly"]) & !is.na(df[i, "Q14_Weekly"]) &
   !is.na(df[i, "Q15_Weekly"]) & !is.na(df[i, "Q16_Weekly"]) & 
   !is.na(df[i, "Q17_Weekly"]) & 
   max( df[i, "Q13_Weekly"], df[i, "Q14_Weekly"],
        df[i, "Q15_Weekly"], df[i, "Q16_Weekly"],
        df[i, "Q17_Weekly"] ) >= THRESHOLD_MDD_GAD){
  max_function  <- TRUE
}

sum_of_symptoms_7 <- anxiety + worry + agitation + anger + 
                     physical_fatigue + no_concentration + disturbed_sleep

if (anxiety_and_worry && (sum_of_symptoms_7 >= CRITERIA_NEEDED_GAD) && max_function){
  # Generalized Anxiety Disorder
  df[i, GAD_DESCRIPTPR_EML] <- TRUE
}

If the primary concern is the number of arguments (40 is indeed excessive!) consider [tidying](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) your data: have two columns, one for question number and one for response, rather than one for each question. alternatively, you could pass your conditions as a named list, the names of the list identifying the new column names and the values of the list giving the expression to evaluate to populate the new columns — Limey, Feb 28 '21 at 11:36

score 1 · Answer 1 · answered Feb 28 '21 at 11:12

Basically the function with if statements is not vectorised. You have two options.

Make the function vectorised (using ifelse or any other way) and continue using it with mutate like you have earlier.

library(dplyr)
library(purrr)

df %>% mutate(newcol = f(Q1, Q2, Q3))

If the conditions are too complex and you cannot vectorise the function use rowwise or pmap which will operate on one row at a time. This is similar to your mapply attempt which works.

df %>% mutate(newcol = pmap_dbl(list(Q1, Q2, Q3), ~f(..1, ..2, ..3)))

dario · Answer 2 · 2021-02-28T10:48:58.307

The reason you get the "the condition has length > 1 and only the first element will be used" warning is the usage of if in combination iwth a vector (for example, see here). dpylr's mutate passes the "whole" vector of values to the called function, (i.e. not (row) element by element). And that's where the if statement was confused.

This solves your problem:

df <- data.frame(Sex = c(rep("F", 5), rep("M", 5)),
                 Q1  = sample(0:10, 10, replace=T),
                 Q2  = sample(0:10, 10, replace=T),
                 Q3  = sample(0:10, 10, replace=T)
)

f <- function(q1, q2, q3){
  y <- q1 + (q2^2) - (q3^3)
  y <- ifelse(y<0, -y, y)
  return(y)
} 

df %>%
  mutate(newcol = f(Q1, Q2, Q3))

Returns:

   Sex Q1 Q2 Q3 newcol
1    F  8  6  3     17
2    F  6  0  0      6
3    F  4  5  7    314
4    F  9  5  7    309
5    F  3  5  9    701
6    M  1 10  5     24
7    M 10  5  4     29
8    M  4  0  3     23
9    M  8  4  7    319
10   M  3  6  3     12

In this very simple case it certainly does - thank you. But the general case has many different conditions as well as Boolean combinations of them. How can I solve the problem in general? — Thomas Philips, Feb 28 '21 at 10:48
Can you extend your question with a MRE that fits the problem you are experiencing? — dario, Feb 28 '21 at 10:50

score 0 · Answer 3 · answered Feb 28 '21 at 11:50

To expand on my comment above:

f <- function(data, conditions) {
  columnNames <- names(conditions)
  for (colName in columnNames) {
    qName <- enquo(colName)
    data <- data %>% mutate(!!qName := eval(conditions[[colName]]))
  }
  data
}

df %>% f(list(bigQ1=expression(Q1 > 7), smallQ2=expression(Q2 < 2)))

gives, for example,

   Sex Q1 Q2 Q3 bigQ1 smallQ2
1    F  2  9  9 FALSE   FALSE
2    F  2 10  6 FALSE   FALSE
3    F  9  4  9  TRUE   FALSE
4    F  1  2  8 FALSE   FALSE
5    F  5 10  2 FALSE   FALSE
6    M 10  8  3  TRUE   FALSE
7    M  4  8  0 FALSE   FALSE
8    M  3  8 10 FALSE   FALSE
9    M  5  2  6 FALSE   FALSE
10   M  8  7  4  TRUE   FALSE

Passing the df as the first parameter of the function allows for piping.

dplyr create a new column with a complex user defined function of other columns

3 Answers3