12

Say I have a simple array, with a corresponding probability distribution.

library(stats)    
data <- c(0,0.08,0.15,0.28,0.90)
pdf_of_data <- density(data, from= 0, to=1, bw=0.1)

Is there a way I could generate another set of data using the same distribution. As the operation is probabilistic, it need not exactly match the initial distribution anymore, but will be just generated from it.

I did have success finding a simple solution on my own. Thanks!

puslet88
  • 1,288
  • 15
  • 25

3 Answers3

11

Your best bet is to generate the empirical cumulative density function, approximate the inverse, and then transform the input.

The compound expression looks like

random.points <- approx(
  cumsum(pdf_of_data$y)/sum(pdf_of_data$y),
  pdf_of_data$x,
  runif(10000)
)$y

Yields

hist(random.points, 100)

enter image description here

user295691
  • 7,108
  • 1
  • 26
  • 35
  • This is great! Thank you! – puslet88 Oct 03 '15 at 12:24
  • 2
    This is a fantastic answer! I'm going to add it to the `stackoverflow` package. – Neal Fultz Oct 10 '15 at 18:34
  • You might want to look at the `ecdf` function. It does much of that for you. – IRTFM Mar 22 '18 at 20:51
  • 1
    @42- I don't see how the `ecdf` function helps in this case. We have a pdf, which we can integrate (with `cumsum`) to get a cdf, which we invert to get a quantile function, which we can then draw from. `ecdf` generates a cdf from data drawn from a distribution. – user295691 Mar 23 '18 at 14:20
8

From the examples in the documentation of ?density you (almost) get the answer.

So, something like this should do it:

library("stats")    
data <- c(0,0.08,0.15,0.28,0.90)
pdf_of_data <- density(data, from= 0, to=1, bw=0.1)

# From the example.
N <- 1e6
x.new <- rnorm(N, sample(data, size = N, replace = TRUE), pdf_of_data$bw)

# Histogram of the draws with the distribution superimposed.
hist(x.new, freq = FALSE)
lines(pdf_of_data)

Imgur

You can just reject the draws outside your interval as in rejection sampling. Alternatively, you can use the algorithm described in the link.

Anders Ellern Bilgrau
  • 9,928
  • 1
  • 30
  • 37
  • 3
    Isn't this just adding gaussian noise to the data, not drawing from the smoothed density curve? – Neal Fultz Sep 30 '15 at 16:58
  • 1
    @NealFultz Remember that the smoothed density curve is a mixture of Gaussian distributions---one for each observation. So we're just sampling directly from each component. But yes, you're also right. – Anders Ellern Bilgrau Sep 30 '15 at 17:01
  • @NealFultz Right. But I guess it's easily modified by using the proper function instead of `rnorm`. – Anders Ellern Bilgrau Sep 30 '15 at 17:05
  • Thanks, this is really great! I guess rejection sampling would look like this: `expected_sample_size <- 2; new_sample <- list(); while(length(new_sample) < expected_sample_size ){ new_item <- rnorm(1, sample(data, size = 1, replace = TRUE), data_pdf$bw); if(new_item >0 & new_item < 1){ new_sample[length(new_sample)+1] <- new_item} } ` – puslet88 Oct 03 '15 at 11:45
  • With this approach there is one limitation though in that if the original distribution predicts 0 chance for some number, the new sample might still contain it as the variation is added after the probability density distribution is used. It seems to me. – puslet88 Oct 03 '15 at 11:49
4

To draw from the curve:

sample(pdf_of_data$x, 1e6, TRUE, pdf_of_data$y)
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
  • 4
    Note, that this approach is strongly dependent on the `n` argument in `density`. You'll never get more than `n` unique values. You're just sampling from a discrete probability function with `n` values and corresponding probabilities. Depending on your application, this might be just fine. – Anders Ellern Bilgrau Sep 30 '15 at 17:07
  • That's a good point, and `n` might be manipulated to fit the data. For my purposes this actually works too. Too many correct and useful answers to choose from here. Many thanks again! – puslet88 Oct 03 '15 at 12:28