0

I've tried everywhere to find the answer to this question but I am still stuck, so here it is:

I have a data frame data_1 that contains data from an ongoing latent profile analysis. The variables of interest for this question are profiles and gender.

I would like to plot gender distribution by profile, but within each profile show what % of each gender we have compared to the entire sample of this gender. For example, if we have 10 women and 5 in Profile 1, I want the text on top of the women bar for Profile 1 to show 50%.

Right now I am using the following code but it is giving me the percentage for the entire population, while I just want the percentage compared to the total number of women.

ggplot(data = subset(data_1, !is.na(gender)),
       aes(x = gender, fill = gender)) + geom_bar() +
  facet_grid(cols=vars(profiles)) + theme_minimal() +
  scale_fill_brewer(palette = 'Accent', name = "Gender", 
                    labels = c("Non-binary", "Man", "Woman")) +
  labs(x = "Gender", title = "Gender distribution per LPA profile") +
  geom_text(aes(y = ((..count..)/sum(..count..)), 
                label = scales::percent((..count..)/sum(..count..))), 
            stat = "count", vjust = -28)

Thanks in advance for your help!

I tried multiple alternatives including creating the variable within the dataset using summarize and mutate but with no success unfortunately.

ebianchi
  • 3
  • 1
  • 1
    This will be easier to answer [make the question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including a small representative dataset in a plain text format - for example the output from `dput(data_1)`, if that is not too large. – neilfws Jan 18 '23 at 22:54
  • It would be easier to help you if you provide [a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a snippet of your data or some fake data. But in general I would go for computing the counts and percentages outside of `ggplot()` as the easiest approach. – stefan Jan 18 '23 at 22:54

1 Answers1

1

As untidy as it seems, it's likely the best approach to summarise outside of the ggplot2 call, which can be done like this:

library(tidyverse)

data1 <- tibble(gender = sample(c("male", "female"), 100, replace = TRUE),
                profile = sample(c("profile1", "profile2"), 100, replace = TRUE))

data1 |> 
  count(gender, profile) |>
  group_by(gender) |> 
  mutate(perc = n / sum(n)) |> 
  ggplot(aes(x = gender, y = n, fill = gender)) +
  geom_col() +
  facet_grid(~profile) +
  geom_text(aes(y = n + 3, label = scales::percent(perc)))

The facet_grid is essentially grouping the dataset by profile before doing any calculations of values, so in essence it's blind to the data in the other facet. I think only approach is thus summarising before the call and using geom_col (defaulting to stat = "identity") to make the plots. Note that the y value for the labels is calculated from the count variable - R will position the text relative to the counted values of the bars.

Edit - actually no, there's a "simpler" way

I tell a lie, you can actually do it in the ggplot2 call, but it's a little messier:

data1 |>
  ggplot(aes(x = gender, fill = gender)) +
  geom_bar() +
  facet_grid(~ profile) +
  stat_count(aes(y = after_stat(count) + 2,
              label = scales::percent(after_stat(count) / 
                                      tapply(after_stat(count), 
                                             after_stat(group), 
                                             sum)[after_stat(group)]
                 )),
             geom = "text")

Code borrowed from here. The after_stat(group) part is accessing the grouped gender count across both facets. Today I learned something!

Andy Baxter
  • 5,833
  • 1
  • 8
  • 22