3

I'm trying to follow this as an example but can't seem to adapt it to work with my dataset since I need truncated normals: https://stackoverflow.com/questions/35990467/fit-two-gaussians-to-a-histogram-from-one-set-of-data-python#=

I have a dataset that is definitely a mixture of 2 truncated normals. The minimum value in the domain is 0 and the maximum is 1. I want to create an object that I can fit to optimize the parameters and get the likelihood of a sequence of numbers being drawn from that distribution. One option may be to just use the KDE model and using the pdf to get the likelihood. However, I want the exact mean and standard deviations of the 2 distributions. I guess I could, split the data in half and then model the 2 normals separately but I also want to learn how to use optimize in SciPy. I'm just starting to experiment with this type of statistical analysis so my apologies if this seems naive.

I'm not sure how to get a pdf this way that can integrate to 1 and have a domain constrained between 0 and 1.

import requests
from ast import literal_eval
from scipy import optimize, stats
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Actual Data
u = np.asarray(literal_eval(requests.get("https://pastebin.com/raw/hP5VJ9vr").text))
# u.size ==> 6000
u.min(), u.max()
# (1.3628525454666037e-08, 0.99973136607553781)

# Distribution
with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    sns.kdeplot(u, color="black", ax=ax)
    ax.axvline(0, linestyle=":", color="red")
    ax.axvline(1, linestyle=":", color="red")
kde = stats.gaussian_kde(u)

enter image description here

# KDE Model
def truncated_gaussian_lower(x,mu,sigma,A):
    return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=0, a_max=None)
def truncated_gaussian_upper(x,mu,sigma,A):
    return np.clip(A*np.exp(-(x-mu)**2/2/sigma**2), a_min=None, a_max=1)
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
kde = stats.gaussian_kde(u)

# Estimates: mu sigma A
estimates= [0.1, 1, 3, 
            0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )

# ---------------------------------------------------------------------------
# RuntimeError                              Traceback (most recent call last)
# <ipython-input-265-b2efb2ca0e0a> in <module>()
#      32 estimates= [0.1, 1, 3, 
#      33             0.9, 1, 1]
# ---> 34 params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u),estimates )

# /Users/mu/anaconda/lib/python3.6/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
#     738         cost = np.sum(infodict['fvec'] ** 2)
#     739         if ier not in [1, 2, 3, 4]:
# --> 740             raise RuntimeError("Optimal parameters not found: " + errmsg)
#     741     else:
#     742         # Rename maxfev (leastsq) to max_nfev (least_squares), if specified.

# RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 1400.

In response to @Uvar 's very helpful explanation below. I am trying to test the integral from 0 - 1 to see if it equals 1 but I'm getting 0.3. I think I'm missing a crucial step in logic:

# KDE Model
def truncated_gaussian(x,mu,sigma,A):
    return A*np.exp(-(x-mu)**2/2/sigma**2)

def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    if type(x) == np.ndarray:
        norm_probas = truncated_gaussian(x,mu1,sigma1,A1) + truncated_gaussian(x,mu2,sigma2,A2)
        mask_lower = x < 0
        mask_upper = x > 1
        mask_floor = (mask_lower.astype(int) + mask_upper.astype(int)) > 1
        norm_probas[mask_floor] = 0
        return norm_probas
    else:
        if (x < 0) or (x > 1):
            return 0
        return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

kde = stats.gaussian_kde(u, bw_method=2e-2)

# # Estimates: mu sigma A
estimates= [0.1, 1, 3, 
            0.9, 1, 1]
params,cov= optimize.curve_fit(mixture_model,u,kde.pdf(u)/integrate.quad(kde, 0 , 1)[0],estimates ,maxfev=5000)
# params
# array([  9.89751700e-01,   1.92831695e-02,   7.84324114e+00,
#          3.73623345e-03,   1.07754038e-02,   3.79238972e+01])

# Test the integral from 0 - 1
x = np.linspace(0,1,1000)
with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    ax.plot(x, kde(x), color="black", label="Data")
    ax.plot(x, mixture_model(x, *params), color="red", label="Model")
    ax.legend()
# Integrating from 0 to 1
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
# 0.3026863969781809

enter image description here

O.rka
  • 29,847
  • 68
  • 194
  • 309
  • I'm looking for a way to get a custom scipy distribution where the pdf integrates to 1 and is bound between 0 and 1 having 2 peaks; one around 0 (with a higher density) and another around 1 (with a lower density). – O.rka Aug 08 '17 at 03:59

1 Answers1

3

It seems you are misspecifying the fitting procedure. You are trying to fit the kde.pdf(u) while constraining half-bounds.

foo = kde.pdf(u)

min(foo)
Out[329]: 0.22903365654960098

max(foo)
Out[330]: 4.0119283429320332

As you can see, the probability density function of u is not constrained to [0,1]. As such, just deleting the clipping action, will result in an exact fit.

def truncated_gaussian_lower(x,mu,sigma,A):
    return A*np.exp((-(x-mu)**2)/(2*sigma**2))
def truncated_gaussian_upper(x,mu,sigma,A):
    return A * np.exp((-(x-mu)**2)/(2*sigma**2))
def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

estimates= [0.15, 1, 3, 
            0.95, 1, 1]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=kde.pdf(u), p0=estimates)

params
Out[327]: 
array([ 0.00672248,  0.07462657,  4.01188383,  0.98006841,  0.07654998,
        1.30569665])

y3 = mixture_model(u, params[0], params[1], params[2], params[3], params[4], params[5])

plt.plot(kde.pdf(u)+0.1) #add offset for visual inspection purpose

plt.plot(y3)

Perfect overlap, made visible by the +0.1 offset

So, let's now say I change what I am plotting to:

plt.figure(); plt.plot(u,y3,'.')

Which does look like what you are trying to achieve

Because, indeed:

np.allclose(y3, kde(u), atol=1e-2)
>>True

You can edit the mixture model a bit to be 0 outside of the domain [0, 1]:

def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    if (x < 0) or (x > 1):
        return 0
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

Doing so, however, will lose the option of instantly evaluating the function over an array of x.. So for the sake of argument, I will leave it out for now.

Anyway, we want our integral to sum up to 1 in the domain [0, 1], and one way to do this (feel free to play around with the bandwidth estimator in stats.gaussian_kde as well..) is to divide the probability density estimate by its integral over the domain. Take care that optimize.curve_fit only takes 1400 iterations in this implementation, so the initial parameter estimates matter.

from scipy import integrate
sum_prob = integrate.quad(kde, 0 , 1)[0]
y = kde(u)/sum_prob
# Estimates: mu sigma A
estimates= [0.15, 1, 5, 
            0.95, 0.5, 3]
params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=y, p0=estimates)
>>array([  6.72247814e-03,   7.46265651e-02,   7.23699661e+00,
     9.80068414e-01,   7.65499825e-02,   2.35533297e+00])

y3 = mixture_model(np.arange(0,1,0.001), params[0], params[1], params[2], 
    params[3], params[4], params[5])

with plt.style.context("seaborn-white"):
    fig, ax = plt.subplots()
    sns.kdeplot(u, color="black", ax=ax)
    ax.axvline(0, linestyle=":", color="red")
    ax.axvline(1, linestyle=":", color="red")
    plt.plot(np.arange(0,1,0.001), y3) #The red line is now your custom pdf with area-under-curve = 0.998 in the domain..

Total plot

To check for the area under the curve, I used this hacky solution of redefining mixture_model..:

def mixture_model(x):
    mu1=params[0]; sigma1=params[1]; A1=params[2]; mu2=params[3]; sigma2=params[4]; A2=params[5]
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)

from scipy import integrate
integrated_value, error = integrate.quad(mixture_model, 0, 1) #0 lower bound, 1 upper bound
>>(0.9978588016186962, 5.222293368393178e-14)

Or doing the integral a second way:

import sympy
x = sympy.symbols('x', real=True, nonnegative=True)
foo = sympy.integrate(params[2]*sympy.exp((-(x-params[0])**2)/(2*params[1]**2))+params[5]*sympy.exp((-(x-params[3])**2)/(2*params[4]**2)),(x,0,1), manual=True)
foo.doit()

>>0.562981541724715*sqrt(pi) #this evaluates to 0.9978588016186956

And actually doing it your way as described in your edited question:

def mixture_model(x,mu1,sigma1,A1,mu2,sigma2,A2):
    return truncated_gaussian_lower(x,mu1,sigma1,A1) + truncated_gaussian_upper(x,mu2,sigma2,A2)
integrate.quad(lambda x: mixture_model(x, *params), 0,1)[0]
>>0.9978588016186962

If I set my bandwidth to your level (2e-2), indeed the evaluation comes down to 0.92, which is a worse result than the 0.998 we had earlier, but that is still significantly different from the 0.3 you report which is something I cannot recreate, even while copying your code snippets. Do you perhaps accidentally redefine functions/variables somewhere?

Uvar
  • 3,372
  • 12
  • 25
  • Is there any way to model the mixture of truncated normals? Maybe the clips weren't the best way but it looks like the means are correct on this. If I have a vector of values, I'm wondering how I could get the likelihood while considering the domain of this pdf being in the range of 0 - 1 – O.rka Aug 07 '17 at 15:53
  • I do not understand your question here. As you can see, the u values are in the range [0, 1]; however, pdf values do not need to be in this range, it is probability DENSITY, not probability. The integral will evaluate to approximately 1, since the probability over the whole domain should be 1. And, from the parameter optimization, you have your 2 Gaussians as fit. What else is left? Is it that you are missing the offset I added to visualize that the shapes are corresponding? If I remove the offset, you will see 1 line, with the other hidden underneath.. – Uvar Aug 07 '17 at 16:04
  • Thanks for this and the explanation! Let me digest this and get back to you. One thing that confuses me is that a pdf should integrate to 1 w/in it's domain right? If I calculate the `kde.pdf(1.1)` I get `0.37351788`. Shouldn't this value be 0? – O.rka Aug 07 '17 at 16:41
  • In principle this is true, except for the fact that `stats.gaussian_kde` is not an exact representation of your probability density function, it is an estimate heavily reliant on selection of the bandwidth parameter and returns a smooth curve which exists outside the domain `[0, 1]` as well. In fact, in the code example provided, simple numerical integration will give a total "area" of `0.55` – Uvar Aug 08 '17 at 06:58
  • Edited answer. Now if you want for mixture_model(1.1, *args) = 0, then you can add an if-clause in mixture_model to return 0 if x is outside of the target domain. Another can be to precompute a `mask = np.where((x < 0) or (x > 1)); y3[mask] = 0` – Uvar Aug 08 '17 at 07:24
  • thanks for the edits. I tried implementing your solution above and I feel like I'm almost there but missing a step in logic. I tried integrating the final solution from `0 - 1` but ended up with a value of `0.3`. – O.rka Aug 08 '17 at 19:37
  • @O.rka@ I edited once more, the code snippet you have pasted works well for me; there might be some hidden redefinition of functions/variables in your actual script? – Uvar Aug 09 '17 at 09:36
  • When I ran ```sum_prob = integrate.quad(kde, 0 , 1)[0] y = kde(u)/sum_prob # Estimates: mu sigma A estimates= [0.15, 1, 5, 0.95, 0.5, 3] params,cov= optimize.curve_fit(f=mixture_model, xdata=u, ydata=y, p0=estimates)``` I got a `maxfev` error. When I exchanged `u` for `x = np.linspace(0,1,1000)` it worked. Is this what you did or did you use `u`? – O.rka Aug 09 '17 at 22:57
  • I can't thank you enough for dealing with my dumb questions. I have been trying to figure out how to do Bayesian style inference in SciPy instead of Pymc3. I have also been trying to figure out how to use the optimize module in SciPy. I'm still a little confused on how it's actually doing the curve fitting but I can read more about it. Thanks again for being patient. – O.rka Aug 09 '17 at 23:19
  • I'm stuck on one more part but made a new question. Just linking you to it in case it's a quick and obvious fix for you and your expertise. https://stackoverflow.com/questions/45623440/how-to-convert-part-of-a-kernel-density-curve-into-a-probability-distribution-fu – O.rka Aug 10 '17 at 21:17