26

I have three cohorts of students identified by an ExperimentCohort factor. For each student, I have a LetterGrade, also a factor. I'd like to plot a histogram-like bar graph of LetterGrade for each ExperimentCohort. Using

ggplot(df, alpha = 0.2, 
       aes(x = LetterGrade, group = ExperimentCohort, fill = ExperimentCohort))                                                                                                                                                       
  + geom_bar(position = "dodge")

gets me very close, but the three ExperimentCohorts don't have the same number of students. To compare these on a more even field, I'd like the y-axis to be the in-cohort proportion of each letter-grade. So far, short of calculating this proportion and putting it in a separate dataframe before plotting, I have not been able to find a way to do this.

Every solution to a similar question on SO and elsewhere involves aes(y = ..count../sum(..count..)), but sum(..count..) is executed across the whole dataframe rather than within each cohort. Anyone got a suggestion? Here's code to create an example dataframe:

df <- data.frame(ID = 1:60, 
        LetterGrade = sample(c("A", "B", "C", "D", "E", "F"), 60, replace = T),
        ExperimentCohort = sample(c("One", "Two", "Three"), 60, replace = T))

Thanks.

Claire Sannier
  • 902
  • 2
  • 8
  • 19

3 Answers3

25

Wrong solution

You can use stat_bin() and y=..density.. to get percentages in each group.

ggplot(df, alpha = 0.2,
      aes(x = LetterGrade, group = ExperimentCohort, fill = ExperimentCohort))+
      stat_bin(aes(y=..density..), position='dodge')

UPDATE - correct solution

As pointed out by @rpierce y=..density.. will calculate density values for each group not the percentages (they are not the same).

To get the correct solution with percentages one way is to calculate them before plotting. For this used function ddply() from library plyr. In each ExperimentCohort calculated proportions using functions prop.table() and table() and saved them as prop. With names() and table() got back LetterGrade.

df.new<-ddply(df,.(ExperimentCohort),summarise,
              prop=prop.table(table(LetterGrade)),
              LetterGrade=names(table(LetterGrade)))

 head(df.new)
  ExperimentCohort       prop LetterGrade
1              One 0.21739130           A
2              One 0.08695652           B
3              One 0.13043478           C
4              One 0.13043478           D
5              One 0.30434783           E
6              One 0.13043478           F

Now use this new data frame for plotting. As proportions are already calculated - provided them as y values and added stat="identity" inside the geom_bar.

ggplot(df.new,aes(LetterGrade,prop,fill=ExperimentCohort))+
  geom_bar(stat="identity",position='dodge')

enter image description here

Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
  • Nailed it. Thanks a lot... don't know how I failed to find this answer elsewhere. Do you know what it is about `..count..` that behaves this way while `..density..` does not? Or maybe it's endemic to the difference between `geom_bar` and `stat_bin`? – Claire Sannier Jun 28 '13 at 15:52
  • stat_bin function is applied to each group separately – Didzis Elferts Jun 28 '13 at 15:54
  • Except this answer isn't quite right: http://stats.stackexchange.com/questions/4220/a-probability-distribution-value-exceeding-1-is-ok. See: http://stackoverflow.com/questions/17655648/how-can-i-plot-the-relative-proportions-of-two-groups-using-a-fill-aesthetic-in for the correct solution. – russellpierce Jul 22 '13 at 00:26
  • 1
    @rpierce Corrected my answer. – Didzis Elferts Jul 22 '13 at 05:46
  • 9
    (+1) I tried this recently, and got most ways home, but needed to wrap this `prop=prop.table(table(LetterGrade))` in a call to `as.numeric`, so, `prop=as.numeric(prop.table(table(LetterGrade)))`. – tchakravarty Nov 02 '14 at 17:08
  • This is not an histogram, it's a _barplot_ – Julien Jan 28 '23 at 16:44
8

You can also do this by creating a weight column that sums to 1 for each group:

ggplot(df %>%
         group_by(ExperimentCohort) %>%
         mutate(weight = 1 / n()),
       aes(x = LetterGrade, fill = ExperimentCohort)) +
  geom_histogram(aes(weight = weight), stat = 'count', position = 'dodge')
sirallen
  • 1,947
  • 14
  • 21
0

I recently attempted this and received an error calling ddply: Column prop must be length 1 (a summary value), not 6. Spent some time with ddply but couldn't quite get the solution to work so I offer up an alternative (note this still makes use of plyr):

df.new <- df2 %>% 
    group_by(ExperimentCohort,LetterGrade) %>% 
    summarise (n = n()) %>%
    mutate(freq = n / sum(n))

Then you can plot it just as @didzis-elferts mentioned:

ggplot(df.new,aes(LetterGrade,freq,fill=ExperimentCohort))+
    geom_bar(stat="identity",position='dodge')
mploenzke
  • 9
  • 1
  • could you explain the mutate step? why is sum(n) only within experimentcohort. I would have expected sum(n) to be for whole data frame - seems magic! – seanv507 Feb 09 '19 at 13:42
  • 1
    If we had wanted `sum(n)` to be calculated over the whole data frame we would need to call `ungroup()` within the pipe before the mutate call. Note you can check the grouping by using `grouping(df.new)`, and that the `summarise` call will 'ungroup' the last grouping variable. – mploenzke Feb 10 '19 at 14:25
  • For example, add a column to the data frame: `df <- data.frame(ID = 1:60,LetterGrade = sample(c("A", "B", "C", "D", "E", "F"), 60, replace = T),ExperimentCohort = sample(c("One", "Two", "Three"), 60, replace = T), test = sample(c("A", "B", "C", "D", "E", "F"), 60, replace = T))` Then compare the output from: `df %>% group_by(ExperimentCohort,LetterGrade,test) %>% summarise (n = n()) %>% group_vars()` with: `df %>% group_by(ExperimentCohort,LetterGrade,test) %>% group_vars()` – mploenzke Feb 10 '19 at 14:28
  • This is why the `mutate(freq = n / sum(n))` calculates the sum over the ExperimentCohort group and no longer over the LetterGrade group as well. Hope that helps clear it up! – mploenzke Feb 10 '19 at 14:30
  • The question is for histograms, not barplots – Julien Jan 28 '23 at 16:50