4

I'm looking to plot the following histograms:

library(palmerpenguins)
library(tidyverse)

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram() + 
  facet_wrap(~species)

enter image description here

For each histogram, I would like to add a Normal Distribution to each histogram with each species mean and standard deviation.

Of course I'm aware that I could compute the group specific mean and SD before embarking on the ggplot command, but I wonder whether there is a smarter/faster way to do this.

I have tried:

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram() + 
  facet_wrap(~species) + 
  stat_function(fun = dnorm)

But this only gives me a thin line at the bottom:

enter image description here

Any ideas? Thanks!

Edit I guess what I'm trying to recreate is this simple command from Stata:

hist bill_length_mm, by(species) normal

which gives me this: enter image description here

I understand that there are some suggestions here: using stat_function and facet_wrap together in ggplot2 in R

But I'm specifically looking for a short answer that does not require me creating a separate function.

Moritz Schwarz
  • 2,019
  • 2
  • 15
  • 33
  • You'll need to calculate this. Try manual calculation of `dnorm(penguins$bill_length_mm)` - you'll notice *ridiculously* small numbers (to the power of -300 or so!). I guess you need to bin those first to make sense of that dnorm call. Rounding does not help, so I don't think it's (merely) a floating point issue – tjebo Jan 27 '21 at 18:02
  • 1
    Thanks - I'll try it. I've added the figure that motivated me to try this from Stata. There of course the data is converted to a density – Moritz Schwarz Jan 27 '21 at 18:23

2 Answers2

6

A while I ago I sort of automated this drawing of theoretical densities with a function that I put in the ggh4x package I wrote, which you might find convenient. You would just have to make sure that the histogram and theoretical density are at the same scale (for example counts per x-axis unit).

library(palmerpenguins)
library(tidyverse)
library(ggh4x)

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(binwidth = 1) + 
  stat_theodensity(aes(y = after_stat(count))) +
  facet_wrap(~species)
#> Warning: Removed 2 rows containing non-finite values (stat_bin).

You can vary the bin size of the histogram, but you'd have to adjust the theoretical density count too. Typically you'd multiply by the binwidth.

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(binwidth = 2) + 
  stat_theodensity(aes(y = after_stat(count)*2)) +
  facet_wrap(~species)
#> Warning: Removed 2 rows containing non-finite values (stat_bin).

Created on 2021-01-27 by the reprex package (v0.3.0)

If this is too much of a hassle, you can always convert the histogram to density instead of the density to counts.

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(aes(y = after_stat(density))) + 
  stat_theodensity() +
  facet_wrap(~species)
teunbrand
  • 33,645
  • 4
  • 37
  • 63
3

While the ggh4x package is the way to go in this case, a more generalizable approach is with tapply and the use of the PANEL variable which is added to the data when a facet is applied.

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30) + 
  facet_wrap(~species) + 
  geom_line(aes(y = dnorm(bill_length_mm,
                          mean = tapply(bill_length_mm, species, mean, na.rm = TRUE)[PANEL],
                          sd = tapply(bill_length_mm, species, sd, na.rm = TRUE)[PANEL])))

enter image description here

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Many thanks for that very useful addition. I wasn't aware of the `PANEL` variable, which I'll need to read up on! I'm afraid the question was closed due to some relation to this question: https://stackoverflow.com/questions/1376967/using-stat-function-and-facet-wrap-together-in-ggplot2-in-r Perhaps share your solution there as well! :) – Moritz Schwarz Jan 28 '21 at 10:58
  • Why not use `species` instead of `PANEL`? `PANEL` is just `species` without the labels. – jtr13 Sep 17 '21 at 22:47
  • This is very helpful and I learned about PANEL and a new way to get data on the panels. Be aware that this only puts a point where there is an observation in the original data. With scarcer data, the curve will not be smooth. If you replace geom_line with geom_smooth, it will not know the distribution and will have tails that go negative. This worked, but for my application I also chose ggh4x. – Steven Ouellette Sep 26 '22 at 19:11