9

I want to add a density line (a normal density actually) to a histogram.

Suppose I have the following data. I can plot the histogram by ggplot2:

set.seed(123)    
df <- data.frame(x = rbeta(10000, shape1 = 2, shape2 = 4))

ggplot(df, aes(x = x)) + geom_histogram(colour = "black", fill = "white", 
                                        binwidth = 0.01) 

enter image description here

I can add a density line using:

ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y = ..density..),colour = "black", fill = "white", 
                 binwidth = 0.01) + 
  stat_function(fun = dnorm, args = list(mean = mean(df$x), sd = sd(df$x)))

enter image description here

But this is not what I actually want, I want this density line to be fitted to the count data.

I found a similar post (HERE) that offered a solution to this problem. But it did not work in my case. I need to an arbitrary expansion factor to get what I want. And this is not generalizable at all:

ef <- 100 # Expansion factor

ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01) + 
  stat_function(fun = function(x, mean, sd, n){ 
    n * dnorm(x = x, mean = mean, sd = sd)}, 
    args = list(mean = mean(df$x), sd = sd(df$x), n = ef))

enter image description here

Any clues that I can use to generalize this

  • first to normal distribution,
  • then to any other bin size,
  • and lastly to any other distribution will be very helpful.
HBat
  • 4,873
  • 4
  • 39
  • 56

1 Answers1

15

Fitting a distribution function does not happen by magic. You have to do it explicitly. One way is using fitdistr(...) in the MASS package.

library(MASS)    # for fitsidtr(...)
# excellent fit (of course...)
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dbeta,args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

# horrible fit - no surprise here
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dnorm,args=fitdistr(df$x,"normal")$estimate)

# mediocre fit - also not surprising...
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dgamma,args=fitdistr(df$x,"gamma")$estimate)

EDIT: Response to OP's comment.

The scale factor is binwidth ✕ sample size.

ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=function(x,shape1,shape2)0.01*nrow(df)*dbeta(x,shape1,shape2),
                args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • 1
    Thanks for the generalization to different distributions. My ultimate aim is to fit these lines to the count data instead of density. Do you have any insights on how to do that? (I want to obtain same plot as the third plot of the original post.) – HBat Dec 26 '14 at 22:45
  • The `0.01` value in the formula (`0.01*nrow(df)*dbeta(x,shape1,shape2)`) is not generalizable to different binwidths or sample sizes. Suppose I have a sample size 2474 (instead of 10000) and 0.03 (instead of 0.01). I believe that 0.01 should be the function of bin width and possibly the sample size. – HBat Dec 27 '14 at 14:30
  • 1
    It absolutely *is* generalizable. The `0.01` is the binwidth. So if you use `binwidth=0.03`, then you would use `fun=function(x,shape1,shape2)0.03*nrow(df)*dbeta(x,shape1,shape2)` in the call to `stat_function(...)`. Of course the counts would all change. – jlhoward Dec 27 '14 at 17:06
  • My mistake, it is generalizable to different bin sizes and sample sizes. Thank you for the solution. – HBat Dec 27 '14 at 17:34