0

I would like to plot a histogramme, whereby the y axis shows the proportion of observation in the bin. I tried to use the code as suggested here

https://ggplot2.tidyverse.org/reference/geom_histogram.html

ggplot(data=diamonds, aes(x=carat, after_stat(density))) +  geom_histogram(binwidth = 0.05, position="identity", fill =  "white", colour = "black") 

and here

Normalizing y-axis in histograms in R ggplot to proportion by group

ggplot(data=diamonds, aes(x=carat)) +  geom_histogram(aes(y=..density..), binwidth = 0.05, position="identity", fill =  "white", colour = "black") 

but the y axis range is higher than 1 in both cases.

enter image description here

Also when I decrease the binwidth the range of y axis (i.e. the proportion in the most represented group) becomes higher, which does not make any sense, since the group sizes should decrease if I increase the number of groups.

Pavel Shliaha
  • 773
  • 5
  • 16

2 Answers2

1

This is because the histogram is merely an estimator of density (or distribution), rather than giving you the proportions in each bin. Although a continuous distribution function integrates to 1, it can indeed have a height greater than 1. Plot the density function of a normal distribution with decreasing variance to convince yourself of this. If you want the histogram to reflect proportions in each bin, you will have to create a new categorical variable for which bin it falls in and then summarize it with proportion falling within that bin. My question however would be why you would want to do this, or rather, why is this a better summary of density than the one already given (as it's merely a scaled version of the density and still gives relative proportions)?

Edit:

If you feel this is better interpreted with proportions falling in each bin, the following s.o. post has your answer:

library(ggplot2)
data(diamonds)
ggplot(diamonds, aes(x=carat)) +
  geom_histogram(aes(y=..count../sum(..count..)), binwidth=0.05)
  • Because I want the actual proportions, since it is a much more intuitive way of data representation – Pavel Shliaha May 18 '20 at 17:44
  • I prefer it to be interpreted as a density function myself, but I'll let you decide what's best for your audience. The above edit should have your solution. – Jordan Schupbach May 18 '20 at 18:09
  • Thanks a lot! I am not a full time statistician, so I am not sure, why the density function is better, than the straightforward proportion of observations per bin. Could you give a short explanation or perhaps a link to a discussion? – Pavel Shliaha May 19 '20 at 00:24
  • Happy to. :) My preference is merely a philosophical one. In one case, you are treating the variable as a continuous random variable whereas in the other you are treating it as a discrete random variable. Both are valid distribution functions, but the histogram is an estimator (and visualization) of a continuous random variable. If I truly had a discrete random variable, I would choose a barplot (perhaps stacked) as a visualization of a discrete random variable. The spacing between bars would clearly indicate the different categories the variable could take. – Jordan Schupbach May 19 '20 at 13:30
0

I think this is what you're looking for:

ggplot(data=diamonds, aes(x=carat)) +  
  geom_histogram(aes(y = stat(count/sum(count))), 
  binwidth = 0.1, position="identity", 
  fill =  "white", colour = "black")

Matt
  • 7,255
  • 2
  • 12
  • 34
  • no because some of your observations are bigger than 1, which is impossible for a proportion. All you did is scale axis in the code I said wasnt working – Pavel Shliaha May 18 '20 at 17:43
  • What are you trying to accomplish, then? If you *increase* the binwidth (to 0.15 for example), you will have densities that are below 1. – Matt May 18 '20 at 17:59
  • I am trying the y axis to represent proportion of observations in the bin (as the title suggests). E.g. if there is 100 observations and bin 20 has 20, then the height of the bar should be 0.2 – Pavel Shliaha May 18 '20 at 18:54
  • I edited above to show the proportion of observations per bin. – Matt May 18 '20 at 19:21