I have a data frame, in wide format, containing demographics details and various questionnaire scores.
The data set would look something like the following:
id <- c(1, 2, 3, 4, 5)
gender <- c("Male", "Female", "Male", "Female", "Male")
group <- c("A", "A", "B", "C", "B")
subscale_1 <- c(NA, NA, 3, 2, 3)
subscale_2 <- c(3, 3, NA, 2, NA)
subscale_3 <- c(3, 2, 5, NA, 1)
subscale_4 <- c(1, NA, 3, NA, 5)
subscale_5 <- c(NA, 5, NA, 8, NA)
df <- data.frame(id, gender, group, subscale_1, subscale_2,
subscale_3, subscale_4, subscale_5)
I want to loop the creation of multiple boxplots, with the x-axis being one of the demographic columns while the y-axis being one of the questionnaire scores columns. For example, one of my boxplots could have an x-axis of group, with the y-axis of subscale_1.
However, not every participant has to respond to a particular questionnaire. Hence, there will be different groups of participants who have NAs in a particular questionnaire score column. For example, those who belong to group A will have NAs in the subscale_1 column. For the subscale_2 column, participants in group B are not required to complete it, etc.
While looping the creation of boxplots, I want R to remove such unused factors from the x-axis. Hence, a boxplot where (x = group, y = subscale_1) should contain only groups B and C on the x-axis. Similarly, a boxplot where (x = group, y = subscale_2) should contain only groups A and C on the x-axis.
I have managed to use the following code to loop the creation of multiple plots:
lapply(names(df)[which(names(df) == "subscale_1"):
which(names(df) == "subscale_5")], function(x)
{ggplot(df, aes_string(x = "group", y = x)) + geom_boxplot()})
I have followed some other Stackoverflow threads where others advise adding drop = TRUE for scale_x_discrete and scale_fill_discrete. I have tried it, but the unused x-axis factors are not removed from the created boxplots. A very helpful thread here suggests using df[!is.na(df$questionnaire_column), ], but as my y-axis varies, I am not sure how I can specify the iteration into my lapply code.
In the future, I will still be doing something similar to the above as well. Instead of boxplots, I might be using barplots. Will simply substituting geom_boxplot with geom_bar still enable the code to work, or do I need to do a major overhaul in editing the code? Because I am thinking of looking at means/medians by a particular demographic variable (e.g. group, gender, etc.) across all questionnaire score columns.
Thanks!