I have a large data frame with responses to 40 questions (reprex with 3 questions below) and need to compute a new column that is a complex function of these 40 responses. As it is virtually impossible to write out the function within mutate
, I tried creating a function f
that could be used within mutate
df <- data.frame(Sex = c(rep("F", 5), rep("M", 5)),
Q1 = sample(0:10, 10, replace=T),
Q2 = sample(0:10, 10, replace=T),
Q3 = sample(0:10, 10, replace=T)
)
f <- function(q1, q2, q3){
y <- q1 + (q2^2) - (q3^3)
return(y)
}
Now creating a new column using mutate
works fine.:
df %>%
mutate(newcol = f(Q1, Q2, Q3))
Sex Q1 Q2 Q3 newcol
1 F 10 6 3 19
2 F 0 9 9 -648
3 F 8 1 2 1
4 F 0 4 7 -327
5 F 6 4 1 21
6 M 8 3 3 -10
7 M 2 2 0 6
8 M 10 0 3 -17
9 M 6 9 3 60
10 M 1 7 2 42
as does
df$newcol <- mapply(f, df$Q1, df$Q2, df$Q3)
But if I include even a simple if
atatement in f
as follows
f <- function(q1, q2, q3){
y <- q1 + (q2^2) - (q3^3)
if(y<0){
y <- -y
}
return(y)
}
I immediately have a disaster on my hands:
df %>%
+ mutate(newcol = f(Q1, Q2, Q3))
Sex Q1 Q2 Q3 newcol
1 F 10 6 3 19
2 F 0 9 9 -648
3 F 8 1 2 1
4 F 0 4 7 -327
5 F 6 4 1 21
6 M 8 3 3 -10
7 M 2 2 0 6
8 M 10 0 3 -17
9 M 6 9 3 60
10 M 1 7 2 42
Warning message:
Problem with `mutate()` input `newcol`.
i the condition has length > 1 and only the first element will be used
i Input `newcol` is `f(Q1, Q2, Q3)`.
However,
df$newcol <- mapply(f, df$Q1, df$Q2, df$Q3)
df
Sex Q1 Q2 Q3 newcol
1 F 10 6 3 19
2 F 0 9 9 648
3 F 8 1 2 1
4 F 0 4 7 327
5 F 6 4 1 21
6 M 8 3 3 10
7 M 2 2 0 6
8 M 10 0 3 17
9 M 6 9 3 60
10 M 1 7 2 42
continues to work. Unfortunately, there are lots of if's in my function, and with 40 different arguments to pass to the function, the input to mapply becomes enormous. How can I pass my questions to mapply using a predefined vector, say something like
questions <- c("df$Q1", "df$Q2", "df$Q3")
df$newcol <- mapply(f, questions)
Closely related: How do I define a function with 40 arguments without it running off the page?
It's enitrely possible that I am barking up the wrong tree, and if so, how ought I to go about solving my problem?
Many thanks in advance
Thomas Philips
P.S. Here is the real criterion
if(!is.na(df[i, "Q1_Daily_Mean"]) & df[i, "Q1_Daily_Mean"] >= THRESHOLD_MDD_GAD){
anxiety <- TRUE
}
if(!is.na(df[i, "Q2_Daily_Mean"]) & df[i, "Q2_Daily_Mean"] >= THRESHOLD_MDD_GAD){
worry <- TRUE
}
if(anxiety && worry){
anxiety_and_worry <- TRUE
}
if(!is.na(df[i, "Q3_Daily_Mean"]) & df[i, "Q3_Daily_Mean"] >= THRESHOLD_MDD_GAD ){
agitation <- TRUE
}
if(!is.na(df[i, "Q10_Daily_Mean"]) & df[i, "Q10_Daily_Mean"] >= THRESHOLD_MDD_GAD ){
anger <- TRUE
}
if(!is.na(df[i, "Q2_Weekly"]) & df[i, "Q2_Weekly"] >= THRESHOLD_MDD_GAD ){
physical_fatigue <- TRUE
}
if(!is.na(df[i, "Q5_Weekly"]) & df[i, "Q5_Weekly"] >= THRESHOLD_MDD_GAD ){
no_concentration <- TRUE
}
if(!is.na(df[i, "Q7_Weekly"]) & df[i, "Q7_Weekly"] >= THRESHOLD_MDD_GAD ){
disturbed_sleep <- TRUE
}
if(!is.na(df[i, "Q13_Weekly"]) & !is.na(df[i, "Q14_Weekly"]) &
!is.na(df[i, "Q15_Weekly"]) & !is.na(df[i, "Q16_Weekly"]) &
!is.na(df[i, "Q17_Weekly"]) &
max( df[i, "Q13_Weekly"], df[i, "Q14_Weekly"],
df[i, "Q15_Weekly"], df[i, "Q16_Weekly"],
df[i, "Q17_Weekly"] ) >= THRESHOLD_MDD_GAD){
max_function <- TRUE
}
sum_of_symptoms_7 <- anxiety + worry + agitation + anger +
physical_fatigue + no_concentration + disturbed_sleep
if (anxiety_and_worry && (sum_of_symptoms_7 >= CRITERIA_NEEDED_GAD) && max_function){
# Generalized Anxiety Disorder
df[i, GAD_DESCRIPTPR_EML] <- TRUE
}