1

I have a data frame, in wide format, containing demographics details and various questionnaire scores.

The data set would look something like the following:

id <- c(1, 2, 3, 4, 5)
gender <- c("Male", "Female", "Male", "Female", "Male")
group <- c("A", "A", "B", "C", "B")
subscale_1 <- c(NA, NA, 3, 2, 3)
subscale_2 <- c(3, 3, NA, 2, NA)
subscale_3 <- c(3, 2, 5, NA, 1)
subscale_4 <- c(1, NA, 3, NA, 5)
subscale_5 <- c(NA, 5, NA, 8, NA)
df <- data.frame(id, gender, group, subscale_1, subscale_2, 
                           subscale_3, subscale_4, subscale_5)

I want to loop the creation of multiple boxplots, with the x-axis being one of the demographic columns while the y-axis being one of the questionnaire scores columns. For example, one of my boxplots could have an x-axis of group, with the y-axis of subscale_1.

However, not every participant has to respond to a particular questionnaire. Hence, there will be different groups of participants who have NAs in a particular questionnaire score column. For example, those who belong to group A will have NAs in the subscale_1 column. For the subscale_2 column, participants in group B are not required to complete it, etc.

While looping the creation of boxplots, I want R to remove such unused factors from the x-axis. Hence, a boxplot where (x = group, y = subscale_1) should contain only groups B and C on the x-axis. Similarly, a boxplot where (x = group, y = subscale_2) should contain only groups A and C on the x-axis.

I have managed to use the following code to loop the creation of multiple plots:

lapply(names(df)[which(names(df) == "subscale_1"):
                     which(names(df) == "subscale_5")], function(x) 
                {ggplot(df, aes_string(x = "group", y = x)) + geom_boxplot()})

I have followed some other Stackoverflow threads where others advise adding drop = TRUE for scale_x_discrete and scale_fill_discrete. I have tried it, but the unused x-axis factors are not removed from the created boxplots. A very helpful thread here suggests using df[!is.na(df$questionnaire_column), ], but as my y-axis varies, I am not sure how I can specify the iteration into my lapply code.

In the future, I will still be doing something similar to the above as well. Instead of boxplots, I might be using barplots. Will simply substituting geom_boxplot with geom_bar still enable the code to work, or do I need to do a major overhaul in editing the code? Because I am thinking of looking at means/medians by a particular demographic variable (e.g. group, gender, etc.) across all questionnaire score columns.

Thanks!

DTYK
  • 1,098
  • 1
  • 8
  • 33
  • The 'demographic_variable' is not in the example – akrun Mar 21 '18 at 07:27
  • @akrun The demographic variable varies (e.g. "group", "gender", etc). It will be specified by me when I actually run the script, and when I ran the script on any demographic variables, it worked. I'm unsure how to proceed with the dropping of a particular unused x-axis factor for each demographic variable. – DTYK Mar 21 '18 at 07:30
  • it is hard for us to play around with your code when we cannot run it. In this case specifically, where you are having problems with the x-axis, but don't supply the data for the x-axis. – Axeman Mar 21 '18 at 07:59
  • 1
    Thanks for the edit. Use `ggplot(na.omit(df[c("group", x)]), aes_string(x = "group", y = x)) + geom_boxplot()`, as suggested in [this thread](https://stackoverflow.com/questions/11403104/remove-unused-factor-levels-from-a-ggplot-bar-plot) you linked. Closing now as duplicate, but hope you have been helped! – Axeman Mar 21 '18 at 08:03
  • Ah I actually had an answer more relevant to this showcasing the tidyr::nest and purr::map functions. A way to indeed make a separate plot for every column and for different variables on the x-axis. If this question can be reopened I can post my answer. The difference with this question is that he wants to do it for multiple plots - to which I think the answer can be improved. The linked duplicate question only shows the case for 1 plot. – Lodewic Van Twillert Mar 21 '18 at 08:14
  • Why can't you use facets here? – Jack Brookes Mar 21 '18 at 08:18
  • @Axeman Thanks, it worked. I don't think my question is a duplicate of that question though. It comprises of two questions: (a) looping the creation of multiple plots as well as (b) removing unused x-axis factors. Still, I stand by your decision and am grateful for your help. :) – DTYK Mar 21 '18 at 08:32
  • @LodewicVanTwillert Looking forward to your response even though I already have the answer to my query. Hope the mods reopen this thread. Cheers! – DTYK Mar 21 '18 at 08:34
  • @JackBrookes I have not considered that before. I have over 250 variables/boxplots. Would facets be able to incorporate that much information in them without sacrificing user comprehension? – DTYK Mar 21 '18 at 08:39
  • Hi all. Other answers related to multiple plots (and not just removing unwanted x-variables) can be posted to a relevant question, perhaps such as [this one](https://stackoverflow.com/questions/39242727/how-to-plot-multiple-categorical-variables-in-r). As it stands, the question of multiple plots was already solved by OP, with the `lapply` solution. – Axeman Mar 21 '18 at 08:59
  • If necessary, one could post a new question specifically about how to best approach plotting many variables (after searching for a good duplicate). I hope this is helpful! – Axeman Mar 21 '18 at 09:00
  • 2
    If it helps, here was my proposed solution. Although I don't think this is going to be very readable. Assuming "gender" is a factor (can be "group" too) `# Use the piping operator from dplyr (or , in this case we loaded the tidyverse package) df.nest <- df %>% gather(Subscale, Value, subscale_1:subscale_5) %>% group_by(Subscale) %>% nest() %>% mutate( PlotGender = map(data, function(data) { data %>% filter(!is.na(Value)) %>% mutate(gender = droplevels(gender)) %>% ggplot(., aes_string(x = "gender", y = "Value")) + geom_boxplot()}))` – Lodewic Van Twillert Mar 21 '18 at 09:02
  • @LodewicVanTwillert I tried your script. It ended up as a data frame. How do I extract the plots from it? Need some help with that. Thanks again! – DTYK Mar 21 '18 at 12:04
  • 1
    The plots are the values in `df.nest`. For example, you should be able to get the first plot with `df.nest$PlotGender[1]`. The values in that column are ggplot objects:) – Lodewic Van Twillert Mar 21 '18 at 15:09

0 Answers0