7

I promise this is not just another dice rolling homework problem. I implemented a function to calculate the probability of obtaining less than a sum s when rolling n m-sided dice. My function works for small values of n but I am finding weird results for large values of n. See the attached plot. Anyone have insight into what is going on?

My probability function

Implemented from this math stack exchange

probability <- function(s, m, n) {

  i <- 0:((s-1-n) / m)
  m^(-n) * sum((-1)^i * choose(n, i) * choose(s - 1 - i * m, n))

}

Starts breaking with ~ n > 80

n <- 1:90 # number of dice
m <- 6 # number of sides
s <- floor(mean(1:m)*n) # sum of faces
p <- mapply(probability, s = s, m = m, n = n)
plot(n, p, main = paste("probability of rolling less than floor(", mean(1:m),"* n) as sum of n rolls of a", m, "sided dice"))

enter image description here

Lief Esbenshade
  • 793
  • 4
  • 13
  • 1
    To add insult to injury, my friend has implemented the same algorithm in Mathematica and is not having any problems with large values of n – Lief Esbenshade Apr 28 '20 at 20:32
  • 1
    Those `choose` numbers become huge. For example `choose(80,40)`. Your formula isn't numerically stable. Maybe try calculating things on the log scale. – MrFlick Apr 28 '20 at 20:33
  • 6
    For large `n`, `choose` will completely lose the precision. Maybe you can read about https://stackoverflow.com/a/40527881/12158757 for alternatives – ThomasIsCoding Apr 28 '20 at 20:58
  • Thank you, I'm working on implementing ramanujan's approximation, but am having trouble vectorizing the function and getting it to use `base::choose` for small values of `n` and `k`. I will edit the question to include my progress. – Lief Esbenshade Apr 28 '20 at 22:26
  • Fixed the NaN problem, but the precision is still breaking down for ~ n > 80 – Lief Esbenshade Apr 29 '20 at 00:33

2 Answers2

1

As mentioned in the comments on the origianl question, the problem is that the probability function is asking R to calculate really huge numbers (choose(80,40) = 1.075072e+23) and we are hitting the numerical precision limits of R.

An alternative approach that doesn't involve huge numbers but instead uses lots of numbers is to run monte carlo simulations. This generates a distribution of dice roll sums and compare the observed sum to the distribution. It will take longer to run, but is much easier to do and will not have the numerical precision problems.

mc <- Vectorize(function(s, m, n, reps = 10000) {
  x <- replicate(reps, sum(sample(m, n, replace = TRUE)))
  ecdf(x)(s-1)
})



n <- 1:90 # number of dice
m <- 6 # number of sides
s <- floor(mean(1:m)*n) # sum of faces
analytic_prob <- mapply(probability, s = s, m = m, n = n)
mc_prob <- mapply(mc, s = s, m = m, n = n)


plot(n, analytic_prob, main = paste("probability of rolling less than floor(", mean(1:m),"* n) as sum of n rolls of a", m, "sided dice"),
     sub = "monte carlo in red")
points(n, mc_prob, col = "red")

enter image description here

Lief Esbenshade
  • 793
  • 4
  • 13
0

The problem is caused by R's numerical precision limits. As commenters noted, the n choose k values I am computing above are really, really big (choose(80,40) = 1.075072e+23).

We can use logs to try and keep the problem within R's computational limits. This is the implementation of Ramanujan's approach. Unfortunately, the errors in the approximation compound, and the accuracy decays even faster. The probability function requires adding and subtracting a sequence of very large numbers to get a final value between 0 and 1, and is not tolerant of any imprecision.

0) Rewrite the probability function to be broken into steps

probability <- function(s, m, n) {

  # Probability of getting less than s
  i <- 0:((s-1-n) / m)

  c1 <- choose(n, i)
  c2 <- choose(s - 1 - i * m, n)

  seq <- (-1)^i * (c1 * c2)

  m^(-n) * sum(seq)

}

1) implement an approximation of log(x!)

# using the 'ramanujan' method
ramanujan <- function(n){
  n * log(n) - n + log(n * (1 + 4*n * (1 + 2*n))) / 6 + log(pi) / 2
}

# confirm Ramanujan works correctly
n <- 1:200
diff <- log(factorial(n)) - ramanujan(n)
plot(n, diff) # r returns inf for factorial(171), but up to there the numbers match

2) rewrite choose function using log approximation.

#' This function returns log(choose(n,k)) 
log_nck <- Vectorize(function(n, k) {
  if(n <= k | n < 1 | k < 1) return(log(choose(n,k))) # logs don't like 0 or neg numbers

  return((ramanujan(n) - ramanujan(k) - ramanujan(n-k)))
})

# Check that choose function works
n <- seq(10, 100, 10)
k <- seq(5, 50, 5)
c_real <- log(choose(n, k))
c_approx <- log_nck(n, k)
# If we print them, they appear to match
print(c_real)
print(c_approx)
# and the difference shows pretty small errors. 
print(c_real - c_approx)

3) rewrite the probability function using log choose.

new_probability <- function(s, m, n) {

  # Probability of getting less than s
  i <- 0:((s-1-n) / m)

  c1 <- log_nck(n, i)
  c2 <- log_nck(s - 1 - i * m, n)

  seq <- (-1)^i * exp(c1 + c2)

  return(m^(-n) * sum(seq))

}

Final Testing

n <- 1:90 # number of dice
m <- 6 # number of sides
s <- floor(mean(1:m)*n) # sum of faces

p <- mapply(probability, s = s, m = m, n = n)
newp <- mapply(new_probability, s = s, m = m, n = n)

plot(n, p, main = "Original in black, approximation in red")
points(n, newp, col = "red")

enter image description here

Lief Esbenshade
  • 793
  • 4
  • 13