4

I am on the lookout for the most elegant way to superimpose normal distribution fits in grouped histograms in ggplot2. I know this question has been asked many times before, but none of the proposed options, like this one or this one struck me as very elegant, at least not unless stat_function could be made to work on each particular subsection of the data.

One relatively elegant way to superimpose a normal distribution fit onto a non-grouped histogram that I did come across was using geom_smooth and method="nls" (aside from the fact then that it's not a self-starting function and that starting values have to be specified) :

library(ggplot2)
myhist = data.frame(size = 10:27, counts = c(1L, 3L, 5L, 6L, 9L, 14L, 13L, 23L, 31L, 40L, 42L, 22L, 14L, 7L, 4L, 2L, 2L, 1L) )
ggplot(data=myhist, aes(x=size, y=counts)) + geom_point() + 
     geom_smooth(method="nls", formula = y ~ N * dnorm(x, m, s), se=F, 
                 start=list(m=20, s=5, N=300)) 

enter image description here

I was wondering though whether this approach could also be used to add normal distribution fits to grouped histograms as in

library(devtools)
install_github("tomwenseleers/easyGgplot2",type="source")
library("easyGgplot2") # load weight data
ggplot(weight,aes(x = weight)) + 
+     geom_histogram(aes(y = ..count.., colour=sex, fill=sex),alpha=0.5,position="identity")

enter image description here

I was also wondering if there are any packages that might define a + stat_distrfit() or + stat_normfit() for ggplot2 by any chance (with the possibility for grouping) ? (I couldn't really find anything, but this would seem like a common enough task, so I was just wondering)

Reason I want the code to be as short as possible is that this is for a course, and that I want to keep things as easy as possible...

PS geom_density does not suit my goal and I would also like to plot the counts/frequencies as opposed to the density. I would also like to have them in the same panel, and avoid using facet_wrap

Community
  • 1
  • 1
Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103

1 Answers1

3

Like this?

## simulate your dataset - could not get easyGplot2 to load....
set.seed(1)     # for reproducible example
weight <- data.frame(sex=c("Female","Male"), weight=rnorm(1000,mean=c(65,67),sd=1))

library(ggplot2)
library(MASS)       # for fitdistr(...)
get.params <- function(z) with(fitdistr(z,"normal"),estimate[1:2])
df <- aggregate(weight~sex, weight, get.params)
df <- data.frame(sex=df[,1],df[,2])
x  <- with(weight, seq(min(weight),max(weight),len=100))
gg <- data.frame(weight=rep(x,nrow(df)),df)
gg$y <- with(gg,dnorm(x,mean,sd))
gg$y <- gg$y * aggregate(weight~sex, weight,length)$weight * diff(range(weight$weight))/30

ggplot(weight,aes(x = weight, colour=sex)) + 
  geom_histogram(aes(y = ..count.., fill=sex), alpha=0.5,position="identity") +
  geom_line(data=gg, aes(y=y))  

I suppose "elegant" is in the eye of the beholder. The problem with using stat_function(...) is that the args=... list cannot be mapped using aes(...), as the post in the comments explains. So you have to create an auxiliary data.frame (gg in this example) that has the x- and y-values for the fitted distributions, and use geom_line(...).

The code above uses fitdistr(...) in the MASS package to calculate maximum likelihood estimates of the mean and sd of your data, grouped by gender, based on the normality assumption (you can use a different distribution if that makes sense). It then creates an x-axis by dividing the range in weight into 100 increments, and calculates dnorm(x,...) for the appropriate mean and sd. Since the result is density, we have to adjust that using:

gg$y <- gg$y * aggregate(weight~sex, weight,length)$weight * diff(range(weight$weight))/30

because you want to map this against count data. Note that this assumes you use the default binning in geom_histogram (which divides the range in x into 30 equal increments). Finally, we add the call to geom_line(...) using gg as the layer-specific dataset.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Many thanks for this - yes this was what I was looking for! Still a bit surprising that stat_function() cannot be mapped - I really don't see any intrinsic reason why that shouldn't be possible to implement sooner or later... I'll try to wrap this in a ggplot2.normhist() function in my easyGgplot2 fork to save my students a bit of code... :-) – Tom Wenseleers Sep 06 '15 at 19:38