2

Who can explain this to me? If I run the following

repet <- 10000
size <- 100
p <- .5
data <- (rbinom(repet, size, p) - size * p) / sqrt(size * p * (1-p))
hist(data, freq = FALSE)
x = seq(min(data) - 1, max(data) + 1, .01)
lines(x, dnorm(x), col='green', lwd = 4)

then I get reasonable agreement of the histogram and the theoretical density (due to the Central Limit Theorem).

enter image description here

If I use

hist(data, breaks = 100, freq = FALSE)

the histogram is significantly different from the theoretical density.

enter image description here

This change in behavior happens when I increase the number of breaks from 51 to 52. Why does it happen?

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Funny thing is, I get the same random-looking histogram when I run your code verbatim (every time). As you can see by playing with the value of `breaks`, the data is fine but the histogram can misrepresent it badly. – Brent Bradburn Oct 18 '19 at 05:12
  • Based on the notes about `breaks` in my answer, This fixes the histogram in the question's code: `D = sqrt(size * p * (1-p)); hist(data, freq = FALSE, breaks=seq(-size*p/D,+size*(1-p)/D+0.5,1/D))` – Brent Bradburn Dec 04 '22 at 06:47

2 Answers2

1

Is has to do with the fact that the data you are generating from rbinom isn't continuous. It's discrete. There are only ~35 discrete values in there (with set.seed(15) and length(unique(data))). When you force the histogram to have 100 breaks, you find that many of those bin end up being empty

sum(hist(data, breaks = 100, freq = FALSE)$counts==0)
# [1] 36

So if you'll notice the second histogram has a bar, then a space (for a bar with height 0), repeating. The total area under the curve has to be the same for both histograms but because the bars in the second plot are half as wide, they need to be twice as all.

The point of all of this is to be careful when using histograms with discrete data. They are intended for continuous data. Also, the number of bins you choose can make a big difference on interpretation. If you change defaults, you should have a very good reason to do so.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

Look at the values in data -- the precision is limited to tenths of a unit. Therefore, if you have too many bins, some of the bins will fall between the data points and will have a zero hit count. The others will have a correspondingly higher density.

In your experiments, there is a discontinuous effect because breaks...

is a suggestion only; the breakpoints will be set to pretty values


You can override the arbitrary behavior of breaks by precisely specifying the breaks with a vector. I demonstrate that below, along with a more direct (integer-based) histogram of the binomial results:

probability=0.5 ## probability of success per trial
trials=14 ## number of trials per result
reps=1e6 ## number of results to generate (data size)

## generate histogram of random binomial results
data <- rbinom(reps,trials,probability)
offset = 0.5 ## center histogram bins around integer data values
window <- seq(0-offset,trials+offset) ## create the vector of 'breaks'
hist(data,breaks=window)

## demonstrate the central limit theorem with a predictive curve over the histogram
population_variance = probability*(1-probability) ## from model of Bernoulli trials
prediction_variance <- population_variance / trials
y <- dnorm(seq(0,1,0.01),probability,sqrt(prediction_variance))
lines(seq(0,1,0.01)*trials,y*reps/trials, col='green', lwd=4)

corrected chart


Regarding the first chart shown in the question: Using repet <- 10000, the histogram should be very close to normal (the "Law of large numbers" results in convergence), and running the same experiment repeatedly (or further increasing repet) doesn't change the shape much -- despite the explicit randomness. The apparent randomness in the first chart is also an artifact of the plotting bug in question. To put it more plainly: both charts shown in the question are very wrong (because of breaks).

Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
  • Using `repet <- 10000` is OK for normalizing 14 trials, but `1e6` seems like a sweet spot. – Brent Bradburn Feb 16 '20 at 22:56
  • Note to self: It's really easy to adjust the parameters and regenerate the plots using RStudio. It's not exactly intuitive, but you start with "Run the current selection" (after selecting the entire script). After that, use "Re-run the previous code region" for rapid turnaround. – Brent Bradburn Dec 04 '22 at 02:25