Retain all columns after using group_by summarise, and mutatue dplyr on categorical variable and plot barplot with confidence intervals

Question

I'm new to R.

This is my dataset

df <- tribble( ~Area_of_interst ,~Meds,~Response, 
                 "Internal Med", "asprin", "yes",
                 "Internal Med", "vitamins","no",
                  "Internal Med", "folic acid","yes",
                  "Emergency Med", "asprin", "yes",
                 "Emergency Med", "vitamins","no",
                  "Emergency Med", "folic acid","yes",

I have about 6 different "Area_of_interest". As you can all my variables are categorical. I want to plot a barplot for all the 6 different "Area_of_interest" by meds whiles only filtering those with response "yes" on the same barplot. The barplot should have their respective confidence interval.

I have two questions:

After I used the summarise function, I didn't access to the variable "Area of interest". All the variables are categorical. How do I compute the proportions without using summarise function or I do fix my code below for me to retain all my columns
Compute my confidence interval for barplot for each "area of interest".

df %>% na.omit() %>% 
  group_by(meds, Response) %>% summarise( ct=n()) %>%
  mutate(propn =paste0( round(100*ct/sum(ct),1),"%" )) %>% 
  filter(Response=="yes") %>% ggplot(aes(x=meds, y=propn)) + 
  geom_col(position = "dodge")

Hi! If you change your group_by for `group_by(`Area_of_interst`, Meds, Response)`, the result is what you expect regarding your groups? — Juan Bosco, May 13 '22 at 19:07
You need to calculate the proportions in your call to `summarise`, since it needs all data to be present. Perhaps you can use `summarise(proportion = sum(Response == "yes") / n(), count = n())` or similar. — r2evans, May 13 '22 at 19:09
@JuanBosco, if I use ``` group_by(Area_of_interst, Meds, Response)``` , the computation of proportion will be done by the three variables but I want the proportion to be computed by just "Response" variable @r2evans, I got the proportions to be ones and zeros and that's not I'm looking for. — Denise, May 14 '22 at 02:27
You can try replacing the `summarize` call with a `mutate` call. This will retain all columns unlike `summarize` which will retain only the grouping columns since the number of rows can reduce by summarizing away a few rows. — Prashant Bharadwaj, May 14 '22 at 09:40
@PrashantBharadwaj I tried that I didn't have access to the other columns in my dataframe — Denise, May 16 '22 at 13:00

Prashant Bharadwaj · Answer 1 · 2022-05-17T16:15:51.970

To answer your two questions

Summarize will run summary metrics within each group : which means it is summarizing rows within each group and returns only 1 row for each group - hence all other variables which are not creating the groups will be removed. For your goal, you want to retain all three of your variables as groups when you run summary
Note that for calculating the proportions, you will need to drop the Response variable from the grouping for the next operation where the % of yes and no will be calculated. Unless I misunderstood your percentage metric..
I wouldn't recommend rounding and adding a percentage to the variable, you should be able to do that while formatting the plot using the scales package as recommended in this post or this stackoverflow
Reg confidence intervals, if I understood your question right, confidence intervals are not relevant in your plot since you are plotting a single data point and not any summarizing central tendency operations like mean/median etc. from a distribution. If you had to, you can use geom_errorbar to show confidence intervals

For the plot, I would recommend splitting into multiple facets for clear visibility since you have 6 categories. So use either colour or facets from my answer and remove the other one.

Here's the code -

library(tidyverse)

# load data 
df <- tribble( ~Area_of_interest ,~Meds,~Response, 
               "Internal Med", "asprin", "yes",
               "Internal Med", "asprin", "yes",
               "Internal Med", "asprin", "no",
               
               "Internal Med", "vitamins","no",
               "Internal Med", "folic acid","yes",
               "Emergency Med", "asprin", "yes",
               "Emergency Med", "vitamins","no",
               "Emergency Med", "folic acid","yes")
               

df %>% 
  na.omit() %>% 
  group_by(Area_of_interest, Meds, Response) %>% 
  
  summarise(ct=n(), .groups = 'drop_last') %>% # removes Response from the grouping variable for the next operation
  
  # proportion = % of 'yes'/'no' within every Area for each Med 
  mutate(proportion = 100 * ct/sum(ct)) %>% # Note: mutate conducts operation within each group, which decides the sum(ct)
  
  filter(Response=="yes") %>% 
  
  ggplot(aes(x=Meds, y=proportion, fill = Area_of_interest)) + 
  geom_col(position = "dodge") +
  
  # optionally, separate areas of interest into sub-panels for better visual clarity and remove the colouring with 'fill'
  facet_grid(rows = 'Area_of_interest')

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

Thank you @Prashant. It worked. One last question how do I add the confidence interval. — Denise, May 17 '22 at 04:12
As mentioned in my point 4, since you are plotting a single value of data and not the mean of many data points, confidence interval is not relevant here, unless I misunderstood what you needed to be plotted by the bar in the first place. — Prashant Bharadwaj, May 17 '22 at 07:08
Gotcha @Prashant. I was asked to add CI or other error estimate to the proportions. — Denise, May 17 '22 at 12:13
If you had to, you can use [`geom_errorbar`](https://ggplot2.tidyverse.org/reference/geom_linerange.html) to show confidence intervals — Prashant Bharadwaj, May 17 '22 at 16:16
Hi @Denise. If your question has been satisfactorily answered, could you click on the checkmark on the left of the answer to make it green please? It is helpful for future readers encountering this question, to know that it has been solved and also gives points to the person who answers :) — Prashant Bharadwaj, Jun 21 '22 at 05:26

Retain all columns after using group_by summarise, and mutatue dplyr on categorical variable and plot barplot with confidence intervals

1 Answers1