4

I would like to pull 1000 samples from a custom distribution in R

I have the following custom distribution

library(gamlss)
mu    <- 1    
sigma <- 2 
tau   <- 3   
kappa <- 3
rate  <- 1
Rmax  <- 20

x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)

enter image description here

How can I randomly sample via Monte Carlo simulation from this distribution?

My first attempt was the following code which produced a histogram shape I did not expect.

hist(sample(points, 1000), breaks = 51)

enter image description here

This is not what I was looking for as it does not follow the same distribution as the pdf.

Martin Schmelzer
  • 23,283
  • 6
  • 73
  • 98
Alex
  • 2,603
  • 4
  • 40
  • 73

4 Answers4

3

If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.

Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):

hist(points, 100)

I used 100 breaks on purpose so you could see some of the fine details.

enter image description here

Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)

If you want a Monte Carlo simulation of a sample from this object, you might try something like:

samples <- replicate(1000, sample(points, 100, replace = TRUE))

If you want to generate data using points as a probability density function, that question has been asked and answered here

De Novo
  • 7,120
  • 1
  • 23
  • 39
  • Your explanation is answering my question correctly. I see how the histogram of `points` is what I would expect. Thanks for the great answer. I now realize that it is not the histogram of `points` that I want to simulate, but the histogram of `plot(points~x)` – Alex Mar 20 '18 at 23:58
  • Correct me if I'm wrong, but this answer only explains what `hist(sample(points, 1000), breaks = 51)` is rather than answers to the actual question---how to draw a sample from a given distribution. – Julius Vainora Mar 21 '18 at 00:16
  • @Julius You are not wrong. This an answer to what actually happened with the OPs approach to answering the question. The answer to what is probably the question behind it (how do I generate data from a probability distribution), is [here](https://stackoverflow.com/questions/32871602/r-generate-data-from-a-probability-density-distribution) – De Novo Mar 21 '18 at 00:28
  • It's just interesting that an answer that actually could be flagged as "not an answer" since it's only about the failed attempt rather than the actual question is the most upvoted one and accepted. But if that's what the OP was after, then great. – Julius Vainora Mar 21 '18 at 00:35
  • @Julius If OP wasn't asking for help with the attempt, and was just asking [this](https://stackoverflow.com/questions/32871602/r-generate-data-from-a-probability-density-distribution), then it should have been flagged as a duplicate. The way the problem was approached showed the OP could benefit from an explanation of the difference between the plot of an object's values and the plot of an object's distribution. – De Novo Mar 21 '18 at 00:39
  • Right, it could have been closed as a duplicate, except that in this case an expression for the pdf is given and only data is available in the linked question. Your answer is certainly useful and is too long for a comment; what I found surprising is that it was the best regardless of the question title and zero question marks around the failed attempt description. – Julius Vainora Mar 21 '18 at 01:02
2

Let's define your (not normalized) probability density function as a function:

library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
  Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * 
  pgamma(x, shape = kappa, rate = rate)

Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,

simMCMC <- function(N, init, fun, ...) {
  out <- numeric(N)
  out[1] <- init
  for(i in 2:N) {
    pr <- out[i - 1] + rnorm(1, ...)
    r <- fun(pr) / fun(out[i - 1])
    out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
  }
  out
}

It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:

d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))

enter image description here

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
1

You invert the ECDF of the distribution:

 ecd.points <- ecdf(points)
 invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
 samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
 plot(density (samp.inv.ecd(100) ) )
 plot(density(points) )
 png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
  plot(density(points) , main="The Original"); dev.off()

enter image description here

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:

x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)

> C$value
[1] 11.50361

# normalize by C$value
f <- function (x) 
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)

random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)

histfrommydist

hist((random.points*40),1000) will get the scaling like your original function.

mysteRious
  • 4,102
  • 2
  • 16
  • 36