ggplot add Normal Distribution while using `facet_wrap`

Question

I'm looking to plot the following histograms:

library(palmerpenguins)
library(tidyverse)

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram() + 
  facet_wrap(~species)

For each histogram, I would like to add a Normal Distribution to each histogram with each species mean and standard deviation.

Of course I'm aware that I could compute the group specific mean and SD before embarking on the ggplot command, but I wonder whether there is a smarter/faster way to do this.

I have tried:

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram() + 
  facet_wrap(~species) + 
  stat_function(fun = dnorm)

But this only gives me a thin line at the bottom:

Any ideas? Thanks!

Edit I guess what I'm trying to recreate is this simple command from Stata:

hist bill_length_mm, by(species) normal

which gives me this:

I understand that there are some suggestions here: using stat_function and facet_wrap together in ggplot2 in R

But I'm specifically looking for a short answer that does not require me creating a separate function.

You'll need to calculate this. Try manual calculation of `dnorm(penguins$bill_length_mm)` - you'll notice *ridiculously* small numbers (to the power of -300 or so!). I guess you need to bin those first to make sense of that dnorm call. Rounding does not help, so I don't think it's (merely) a floating point issue — tjebo, Jan 27 '21 at 18:02
Thanks - I'll try it. I've added the figure that motivated me to try this from Stata. There of course the data is converted to a density — Moritz Schwarz, Jan 27 '21 at 18:23

teunbrand · Accepted Answer · 2021-02-11T16:37:07.343

6

A while I ago I sort of automated this drawing of theoretical densities with a function that I put in the ggh4x package I wrote, which you might find convenient. You would just have to make sure that the histogram and theoretical density are at the same scale (for example counts per x-axis unit).

library(palmerpenguins)
library(tidyverse)
library(ggh4x)

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(binwidth = 1) + 
  stat_theodensity(aes(y = after_stat(count))) +
  facet_wrap(~species)
#> Warning: Removed 2 rows containing non-finite values (stat_bin).

You can vary the bin size of the histogram, but you'd have to adjust the theoretical density count too. Typically you'd multiply by the binwidth.

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(binwidth = 2) + 
  stat_theodensity(aes(y = after_stat(count)*2)) +
  facet_wrap(~species)
#> Warning: Removed 2 rows containing non-finite values (stat_bin).

^{Created on 2021-01-27 by the reprex package (v0.3.0)}

If this is too much of a hassle, you can always convert the histogram to density instead of the density to counts.

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(aes(y = after_stat(density))) + 
  stat_theodensity() +
  facet_wrap(~species)

edited Feb 11 '21 at 16:37

answered Jan 27 '21 at 19:49

teunbrand

33,645
4
37
63

I kind of expected that your ggh4x had a stat for that too :) – tjebo Jan 27 '21 at 20:09
2

The main motivator for that entire package is 'things I get frustrated with that should be easier' :) – teunbrand Jan 27 '21 at 20:20
Indeed, this is fantastic! Looking forward to seeing this on CRAN! – Moritz Schwarz Jan 27 '21 at 22:29
1

You're too kind. I probably should at some point! – teunbrand Jan 27 '21 at 22:30
Definitely agree, very useful package – Ian Campbell Jan 27 '21 at 22:50
I'm afraid the question was closed due to some relation to this question: https://stackoverflow.com/questions/1376967/using-stat-function-and-facet-wrap-together-in-ggplot2-in-r Perhaps share your solution there as well! :) – Moritz Schwarz Jan 28 '21 at 10:57
Good point, I've posted as an answer here: https://stackoverflow.com/a/65936807/11374827 – teunbrand Jan 28 '21 at 12:06

score 3 · Answer 2 · answered Jan 27 '21 at 23:16

3

While the ggh4x package is the way to go in this case, a more generalizable approach is with tapply and the use of the PANEL variable which is added to the data when a facet is applied.

penguins %>% 
  ggplot(aes(x=bill_length_mm, fill = species)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30) + 
  facet_wrap(~species) + 
  geom_line(aes(y = dnorm(bill_length_mm,
                          mean = tapply(bill_length_mm, species, mean, na.rm = TRUE)[PANEL],
                          sd = tapply(bill_length_mm, species, sd, na.rm = TRUE)[PANEL])))

answered Jan 27 '21 at 23:16

Ian Campbell

23,484
14
36
57

Many thanks for that very useful addition. I wasn't aware of the `PANEL` variable, which I'll need to read up on! I'm afraid the question was closed due to some relation to this question: https://stackoverflow.com/questions/1376967/using-stat-function-and-facet-wrap-together-in-ggplot2-in-r Perhaps share your solution there as well! :) – Moritz Schwarz Jan 28 '21 at 10:58
Why not use `species` instead of `PANEL`? `PANEL` is just `species` without the labels. – jtr13 Sep 17 '21 at 22:47
This is very helpful and I learned about PANEL and a new way to get data on the panels. Be aware that this only puts a point where there is an observation in the original data. With scarcer data, the curve will not be smooth. If you replace geom_line with geom_smooth, it will not know the distribution and will have tails that go negative. This worked, but for my application I also chose ggh4x. – Steven Ouellette Sep 26 '22 at 19:11

ggplot add Normal Distribution while using `facet_wrap`

2 Answers2

Linked