44

My question is very similar to Normalizing y-axis in histograms in R ggplot to proportion, except that I have two groups of data of different size, and I would like that each proportion is relative to its group size instead of the total size.

To make it clearer, let's say I have two sets of data in a data frame:

dataA<-rnorm(100,3,sd=2)
dataB<-rnorm(400,5,sd=3)
all<-data.frame(dataset=c(rep('A',length(dataA)),rep('B',length(dataB))),value=c(dataA,dataB))

I can plot the two distributions together with:

ggplot(all,aes(x=value,fill=dataset))+geom_histogram(alpha=0.5,position='identity',binwidth=0.5)

and instead of the frequency on the Y axis I can have the proportion with:

ggplot(all,aes(x=value,fill=dataset))+geom_histogram(aes(y=..count../sum(..count..)),alpha=0.5,position='identity',binwidth=0.5)

But this gives the proportion relative to the total data size (500 points here): is it possible to have it relative to each group size?

My goal here is to make it possible to compare visually the proportion of values in a given bin between A and B, independently from their respective size. Ideas which differ from my original one are also welcome!

Thanks!

Community
  • 1
  • 1
Erwan
  • 1,385
  • 1
  • 12
  • 22

1 Answers1

59

Like this? [edited based on OP's comment]

ggplot(all,aes(x=value,fill=dataset))+
  geom_histogram(aes(y=0.5*..density..),
                 alpha=0.5,position='identity',binwidth=0.5)

Using y=..density.. scales the histograms so the area under each is 1, or sum(binwidth*y)=1. As a result, you would use y = binwidth*..density.. to have y represent the fraction of the total in each bin. In your case, binwidth=0.5.

IMO this is a little easier to interpret:

ggplot(all,aes(x=value,fill=dataset))+
  geom_histogram(aes(y=0.5*..density..),binwidth=0.5)+
  facet_wrap(~dataset,nrow=2)
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • 2
    Thank you for your answer, but actually I don't think the density is equivalent to dividing by the counts by dataset, unless the bar width is 1 (see for example in http://stackoverflow.com/questions/11766856/normalizing-y-axis-in-histograms-in-r-ggplot-to-proportion). In particular in my case I would like to read the percentage that a bin represents on the Y axis. – Erwan Mar 04 '14 at 21:07
  • You're right - need to scale by binwidth. See my edits. – jlhoward Mar 04 '14 at 21:37
  • 4
    oh yes, of course! I didn't even realize that when I replied, I guess I was too focused on the version with `y=..count../sum(..count..)`. Apparently with `..count../sum(..count..)` each count is divided by the total count, whereas in the case of `..density..` it is applied to each group independently: I still don't understand why, in case you have an idea? – Erwan Mar 04 '14 at 23:41
  • 3
    I just ran into this issue. According to https://github.com/tidyverse/ggplot2/issues/3522, you may be able to do `aes(y = stat(density*width))`, which will achieve the same result but without needing to manually set the binwidth. It's still a little unclear whether this is *supposed* to be allowed however. – Marius Dec 11 '19 at 00:38