4

I have a complicated (non standard) distribution function that I want to sample to generate simulated data points using the inverse cdf technique. For the sake of this example I will consider a Gaussian distribution

var=100
def f(x,a):
     def g(y):
       return (1/np.sqrt(2*np.pi*var))*np.exp(-y**2/(2*var))
   b,err=integrate.quad(g,-np.inf,x) 
   return b-a

I want to generate values between a=[0,1], a=np.linspace(0,1,10000,endpoint=False) and use scipy.optimize.fsolve to solve for x for each a. I have two questions:

  1. How to use fsolve for an array of values a ?

  2. fsolve takes an initial guess x0, how to pick a good guess value?

Thanks

HuShu
  • 501
  • 6
  • 19
  • What is the purpose of `a` here, are you trying to inverse the cdf for each value in `a`? I don't think `fsolve` will work with an array this way and you'll probably need to call it 10000 times. – simonzack Dec 16 '15 at 08:36
  • With regards to 2 I'm not quite sure but you can bound the solution based on the previous `a` since the cdf is non-decreasing. – simonzack Dec 16 '15 at 08:39
  • @simonzack a is the value of the icdf, I subtract it off in the definition to let fsolve solve the equation. Is there any other numerical way to solve for the upper limit of the integral without using fsolve? – HuShu Dec 16 '15 at 08:45
  • 1
    I don't think this question is the place to address it, but for efficiency I'd suggest using an [ODE integrator](https://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.integrate.ode.html#scipy.integrate.ode) to integrate the PDF _once_, then put the resulting values into an interpolation object to create your inverse CDF. After that, evaluating the inverse CDF is just a function call, one which _is_ vectorized (it works on arrays... I think), so you can get your samples as easily as `icdf(np.random.uniform(size=10000))`. – David Z Dec 16 '15 at 10:07

1 Answers1

1

Here's how you do it, I replaced 10000 with 10 as it's going to take a while. My initial guess is just 0, and I set it to the previous iteration for the next guess as it should be quite close to the solution. You can further bound this if you want so it's strictly above it.

As a side comment this kind of sampling for complicated distributions isn't really feasible, as computing the cdf can be rather difficult. There are other sampling techniques to address these issues such as Gibbs sampling, Metropolis Hastings, etc.

var = 100

def f(x, a):
    def g(y):
        return (1/np.sqrt(2*np.pi*var))*np.exp(-y**2/(2*var))

    b, err = sp.integrate.quad(g, -np.inf, x) 
    return b - a


a = np.linspace(0, 1, 10, endpoint=False)[1:]
x0 = 0
for a_ in a:
    xi = sp.optimize.fsolve(f, x0 + 0.01, args=(a_,))[0]
    print(xi)
    x0 = xi

[EDIT] It seems to get stuck near 0, adding a small number fixes it, I'm not sure why as I don't know how fsolve works.

simonzack
  • 19,729
  • 13
  • 73
  • 118
  • I think [this question](https://stackoverflow.com/questions/28187569/find-the-root-of-a-cubic-function) touches on the reason it gets stuck. – David Z Dec 16 '15 at 09:51
  • @DavidZ That does make some sense. – simonzack Dec 16 '15 at 10:57
  • Thanks, your solution works, I chose the initial guess to be 1 instead of 0. As for your comment regarding using Metropolis-Hasting and other MCMC technique, yes, I did try using emcee to sample the distribution, however when I used the simulated values to maximize the likelihood, my best fit value was way off! Perhaps I will open a new post highlighting what I did and see if there is a way to improve the accuracy. – HuShu Dec 16 '15 at 16:11