0

I'm wondering if it might be possible to draw a theoretical density histogram for a given continuous distribution in R?

By a theoretical histogram, I mean a histogram that is not based on the available random variable generators in R (e.g., hist(rnorm(1e4))). Rather, a histogram that exactly matches the probability density function (pdf) of the continuous distribution in question for a user-defined support (i.e., range for the random variable) with adjustable breaks.

As an example in R, we know that the pdf of a standard normal distribution for the support -5 to 5 is theoretically shaped like obtained by the R code below.

In R, can we turn this exact theoretical pdf to a corresponding theoritical density histogram? Any suggestion as to how this could be done in R?

c = curve(dnorm(x), -5, 5, n = 1e4)
rnorouzian
  • 7,397
  • 5
  • 27
  • 72
  • There is no way to "exactly match" a continuous distribution with a histogram. Any histogram is irredeemably discrete. – IRTFM Jun 22 '17 at 19:31
  • What's the difference between `theoretical pdf (probability density function)` and `theoretical density histogram`? – CPak Jun 22 '17 at 19:32
  • If you already have the pdf I would use `predict` with your breaks and then a barplot – Dan Slone Jun 22 '17 at 19:32
  • Define "how close" you want it. What level of error is acceptable? A normal distribution has a domain of [-Inf, +Inf] – IRTFM Jun 22 '17 at 19:33
  • Is this a duplicate of ... Fitting a density curve to a histogram in R: https://stackoverflow.com/questions/1497539/fitting-a-density-curve-to-a-histogram-in-r – IRTFM Jun 22 '17 at 19:35
  • If you explain what the task really is, maybe we can help you. For a histogram you need to specify breaks, and the programs will do a count. For a bargraph you need to specify midpoints and heights. So far you have not specified either. I'd say the request is "unclear" at the moment. – IRTFM Jun 22 '17 at 19:40

2 Answers2

1

You can do a pretty good job of this with the theoretical cumulative density. By taking the difference between the value of the cumulative distribution at either end of the region, you get how much of the probability should be assigned to each region. These should be weighted by the width of the region. The example below uses the normal distribution.

barplot(diff(2*pnorm(c(-Inf, seq(-4,4,0.5), Inf))), ylim=c(0,0.4))

Of course, if you only have the density, not the cumulative density, you may have to integrate.

Note that the x-axis on the barplot is the index number of the bin, not the actual x value. Because of this, if you simply plot the norm density fuction over this, it will not line up. However, it is easy to make a linear shift in coordinates that will cause the pdf to line up correctly. For this example, the center of the distribution is at 10.9 and each bin of width 0.5 takes up 1.2 in the new scale, so you transform the x's with (x-10.9)/2.4. To overlay the curve, try

barplot(diff(2*pnorm(c(-Inf, seq(-4,4,0.5), Inf))), ylim=c(0,0.4))
Shifted = function(x) dnorm((x-10.9)/2.4)
curve(Shifted, 0.5,21.5, add = T) 

Sorry, having trouble posting pictures from this machine.

G5W
  • 36,531
  • 10
  • 47
  • 80
  • This is faster and cleaner than the answer I was putting together with dnorm and a seq() – Dan Slone Jun 22 '17 at 20:01
  • @G5W - why not `barplot((dnorm(c(-Inf, seq(-4,4,0.5), Inf))), ylim=c(0,0.4))`? – Dan Slone Jun 22 '17 at 20:07
  • That is because the x values do not match up. The x values on my plot are the index number of the bin, not the actual x value. You can shift the pdf to make it fit, but I need a few moments to compute the translation factor. – G5W Jun 22 '17 at 20:07
  • OK thanks. I noticed the bars were shifted over by 1/2 between the two plots – Dan Slone Jun 22 '17 at 20:08
  • @DanSlone without the 2 = 1/(0.5) you would be getting the area of the region. But we want the average height of the pdf in the region so we must divide by the width of the bin. My bins have width 0.5, so you need to multiply by 1/0.5 = 2 – G5W Jun 22 '17 at 20:10
  • Adding pdf to answer. – G5W Jun 22 '17 at 20:21
1

For arbitrary functions, use the integrate function:

c = curve(dnorm, -5, 5, n = 1e4)
breaks <- pretty(c[[1]], nclass.Sturges(c[[1]]))

bar_x <- sapply(1:(length(breaks)-1), function(i) {
    (breaks[i]+breaks[i+1])/2
})

bar_y <- sapply(1:(length(breaks)-1), function(i) {
    integrate(dnorm, breaks[i], breaks[i+1])$value
})

barplot(bar_y, names.arg=bar_x)

enter image description here

thc
  • 9,527
  • 1
  • 24
  • 39
  • Please RUN the following: `barplot(bar_y, names.arg=bar_x, ylim = c(0, max(c$y)) )`. This works well but I'm sure `max(c$y)` is close to `.4` while the maximum number on the y-axis is `.3`? In fact, the height of the largest middle bars of the `barplot` for this case should be as high as `.4` which are not. Any way to fix that? – rnorouzian Jun 23 '17 at 01:03
  • The value of the bar height is determined by the integration value. So the value of the bar is roughly y * dx. If you want the value to be the same as the curve, then just divide by dx, which will be the average y-value over the interval. E.g. `integrate(dnorm, breaks[i], breaks[i+1])$value / (breaks[i+1] - breaks[i])` – thc Jun 23 '17 at 01:49
  • So, two things. First, the max of the y-axis doesn't become the max of the curve. Second, we can't fit the curve to the barplot, that is `curve(dnorm(x), -5, 5, n = 1e4, add = T)` still needs shifting. – rnorouzian Jun 23 '17 at 01:55
  • Just to give the bigger picture, imagine I obtain values of the density function called posterior: `prior = function(x) dnorm(x) ; likelihood = function(x) dt(1.46, 19, x*sqrt(20)) ; marginal = integrate(function(x) prior(x)*likelihood(x), -Inf, Inf)[[1]] ; posterior = function(x) prior(x)*likelihood(x) / marginal` – rnorouzian Jun 23 '17 at 01:58
  • I can usually plot the `posterior` this way: `plot(x <- seq(-5, 5, len = 1e3), posterior(x), col = 3, t = "l")` but really need to show this plot as a histogram? – rnorouzian Jun 23 '17 at 02:01
  • I don't see why the max bar height must be the maximum of the plot. The bar height is the average or integration of the density over an interval. Second, if you want to plot both density and histogram you should use ggplot2, not base R graphics. – thc Jun 23 '17 at 02:04
  • I see, thanks, but the max value on the y-axis could be as large as the one seen in the curve, right? Because, right now when I use `barplot(bar_y, names.arg=bar_x, ylim = c(0, max(c$y)) )` , the largest value on the y-axis is `.3` while the `max(c$y)` is about `.4`? – rnorouzian Jun 23 '17 at 02:17