2

I have done some searching but I cannot seem to be able to find a reasonable way to sample from a truncated normal distribution.

Without truncation I was doing:

samples = [np.random.normal(loc=x,scale=d) for (x,d) in zip(X,D)]

X and D being lists of floats.

Currently I am implementing truncation as such:

def truncnorm(loc,scale,bounds):
  s = np.random.normal(loc,scale)
  if s > bounds[1]:
    return bounds[1]
  elif s < bounds[0]:
    return bounds[0]
  return s

samples = [truncnorm(loc=x,scale=d,bounds=b) for (x,d,b) in zip(X,D,bounds)]

bounds being a list of tuples (min,max)

This approach feels a little awkward, so I'm wondering if there is a better way?

Jonathan Woollett-light
  • 2,813
  • 5
  • 30
  • 58
  • 2
    Well, what you've shown there isn't what is usually called a truncated normal distribution. Instead it's a mixture of a truncated normal with a discrete distribution which has a point mass at bounds[0] and another at bounds[1]. To get a truncated normal, you want `if s > bounds[1]: return truncnorm(loc, scale, bounds)` i.e., if you fall outside, try again. Depending on the bounds, you might have to try many times. If the majority of the mass is between the bounds, trying again is OK. But if the bounds might exclude most of the mass, you might want a different approach. – Robert Dodier Oct 30 '20 at 01:48
  • A different approach would be to invert the cdf, the cdf being a scaled version of the untruncated cdf (the scaling factor being equal to the mass between the bounds). That's going to be some expression involving erf. I don't know if inverse erf is among the functions available in numpy/scipy, but if not, you can invert it via bisection or something simple like that. – Robert Dodier Oct 30 '20 at 01:59
  • 1
    Have you seen [`scipy.stats.truncnorm`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html)? It isn't exactly the same as what you implemented, because (as @RobertDodier pointed out), your distribution has point masses at the bounds. The usual [truncated normal distribution](https://en.wikipedia.org/wiki/Truncated_normal_distribution) doesn't have those point masses at the bounds. – Warren Weckesser Oct 30 '20 at 04:04
  • @WarrenWeckesser I looked at the scipy function but I couldn't figure out how to set the range. – Jonathan Woollett-light Oct 30 '20 at 05:52
  • Dupe of [Python: Random number generator with mean and Standard Deviation](https://stackoverflow.com/q/27831923/1782792)? – jdehesa Oct 30 '20 at 09:24
  • @JohanC I'm not sure what I am missing here but when getting a value from the distribution it frequently is outide the range `a->b` please see; https://repl.it/join/bqjdwnqq-jonathanwoollet – Jonathan Woollett-light Oct 30 '20 at 23:14
  • @JohanC So that seems to work. Maybe add a little extra explanation, put that as the answer and I'll mark it right. – Jonathan Woollett-light Oct 31 '20 at 06:08

1 Answers1

4

Returning the value of the bounds for samples outside them, will result in too many samples falling on the bounds. This is not representative of the actual distribution. The values on the bounds need to be rejected and replaced by a new sample. Such code could be:

def test_truncnorm(loc, scale, bounds):
    while True:
        s = np.random.normal(loc, scale)
        if bounds[0] <= s <= bounds[1]:
            break
    return s

This can be extremely slow given narrow bounds. Scipy's truncnorm handles such cases more efficiently. A bit surprisingly, the bounds are expressed in function of the standard normal, so your call would be:

s = scipy.stats.truncnorm.rvs((bounds[0]-loc)/scale, (bounds[1]-loc)/scale, loc=loc, scale=scale)

Note that scipy works much faster when making use of numpy's vectorization and broadcasting. And once you're used to the notation, it also looks simpler to write and read. All samples can be calculated in one go as:

X = np.array(X)
D = np.array(D)
bounds = np.array(bounds)
samples = scipy.stats.truncnorm.rvs((bounds[:, 0] - X) / D, (bounds[:, 1] - X) / D, loc=X, scale=D)
Jonathan Woollett-light
  • 2,813
  • 5
  • 30
  • 58
JohanC
  • 71,591
  • 8
  • 33
  • 66