3

I am trying to fit a double-exponential (i.e. mixture of two exponential or bi-exp) data using MLE. Though there is no direct example of such a problem, yet I found some hint of using MLE for linear (Maximum Likelihood Estimate pseudocode), sigmoidal (https://stats.stackexchange.com/questions/66199/maximum-likelihood-curve-model-fitting-in-python) and normal (Scipy MLE fit of a normal distribution) distribution fitting. Using these examples I have tested the following code:

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
import scipy.stats as stats

size = 300

def simu_dt():
    ## simulate Exp2 data
    np.random.seed(0)
    ## generate random values between 0 to 1
    x = np.random.rand(size)
    data = []
    for n in x:
        if n < 0.6:
            # generating 1st exp data
            data.append(np.random.exponential(scale=20)) # t1
        else:
            # generating 2nd exp data
            data.append(np.random.exponential(scale=500)) # t2
    return np.array(data)

ydata2 = simu_dt() # call to generate simulated data
## trimming the data at the beginning and the end a bit
ydata2 = ydata2[np.where(2 < ydata2)]
ydata2 = ydata2[np.where(ydata2 < 3000)]

## creating the normalized log binned histogram data
bins = 10 ** np.linspace(np.log10(np.min(ydata2)), np.log10(np.max(ydata2)), 10)
counts, bin_edges = np.histogram(ydata2, bins=bins)
bin_centres = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_width = (bin_edges[1:] - bin_edges[:-1])
counts = counts / bin_width / np.sum(counts)

## generating arbitrary x value
x1 = np.linspace(bin_centres.min(), bin_centres.max(), len(ydata2))

def MLE(params):
    """ find the max likelihood """
    a1, k1, k2, sd = params
    yPred = (1-a1)*k1*np.exp(-k1*x1) + a1*k2*np.exp(-k2*x1)
    negLL = -np.sum(stats.norm.pdf(ydata2, loc=yPred, scale=sd))
    return negLL

guess = np.array([0.4, 1/30, 1/320, 0.2])
bnds = ((0, 0.9), (1/200, 1/2), (1/1000, 1/100), (0, 1))
## best function used for global fitting

results = optimize.minimize(MLE, guess, method='SLSQP', bounds=bnds)

print(results)
A1, K1, K2, _ = results.x
y_fitted = (1-A1)*K1*np.exp(-K1*x1) + A1*K2*np.exp(-K2*x1)

## plot actual data
plt.plot(bin_centres, counts, 'ko', label=" actual data")
plt.xlabel("Dwell Times (s)")
plt.ylabel("Probability")

## plot fitted data on original data
plt.plot(x1, y_fitted, c='r', linestyle='dashed', label="fit")
plt.legend()
plt.xscale('log')
plt.yscale('log')

plt.show()

The fit summary shows:

Out:
 fun: -1.7494005752178573e-16
     jac: array([-3.24161825e-18,  0.00000000e+00,  4.07105635e-16, -6.38053319e-14])
 message: 'Optimization terminated successfully.'
    nfev: 6
     nit: 1
    njev: 1
  status: 0
 success: True
       x: array([0.4       , 0.03333333, 0.003125  , 0.2       ])

This is the plot showing the fit. Though the fit seems working the result returns the guesses I have provided! Also, if I change the guesses the fit is also changing, meaning it probably not converging at all. I am not sure what wrong I am doing. Just to say I am not an expert in Python and math as well. So, any help is highly appreciated. Thanks in Advance.

SCBera
  • 33
  • 1
  • 8
  • 2
    There are a few things I am not sure you are doing as you wish. Most importantly, what do you mean when you say "double exponential" distribution? This?https://en.wikipedia.org/wiki/Laplace_distribution Here you are producing the weighted sum of two distributions with the ratio 0.6 to 0.4 – Cindy Almighty Aug 30 '19 at 08:09
  • Hi Cindy, thanks for your comment. By "double-exponential" I wanted to mean that my actual data have a mixture of two-exponential distributions. For example, in my code, I tried to simulate two exponential with the values of 20 and 500 (units) and the contribution of both of them should equal to 1 (0.4+0.6). Did I answer your query? – SCBera Aug 30 '19 at 08:34
  • Yup, thanks. Wy are you calculating the negative loglikelihood like negLL = -np.sum(stats.norm.pdf(ydata2, loc=yPred, scale=sd))? – Cindy Almighty Aug 30 '19 at 08:56
  • 1
    I would also think about `exp(exp(x))` if I hear "double exponential", so, thanks for the clarification. The outcome of fits like this is often really hard. I'd propose to take the logarithm of your data and then perform two linear fits in the corresponding ranges of the x-values. – nostradamus Aug 30 '19 at 09:26
  • @nostradamus: I have edited it accordingly, thanks. Actually, people have done it in matlab and other progamming language (Igor pro etc. in published papers) but don't know the code or how it works. Thanks for suggestion about the linear fitting of log data. I can try that but I think I need to pre-difine the x-range, right? – SCBera Aug 30 '19 at 11:57
  • @Cindy: Well what I understood is that one need to maximize the logLL but we are using scipy minimization function (I am not sure if there is any maximizer exists). So, one can use a negative value of logLL and minimize it. However, I doubt using the same expression (stats.norm.pdf(ydata2, loc=yPred, scale=sd)) in my case as others (in the links) have used! – SCBera Aug 30 '19 at 13:01

1 Answers1

1

There are several places where I would say you made mistakes. For example, you are passing x1 (equidistant x-values) instead of ydata2 directly. Then, you are using an inappropriate negativeLL, as you are supposed to take the negative log of your own probabilities under the assumption of some parameters. Thus, your fourth parameter is unnecessary.Your function should be:

def MLE(params):
    """ find the max likelihood """
    a1, k1, k2 = params
    yPred = (1-a1)*k1*np.exp(-k1*ydata2) + a1*k2*np.exp(-k2*ydata2)
    negLL = -np.sum(np.log(yPred))
    return negLL

The code still fails to converge for numerical reasons (scales badly), and some suggestions of linearization might help. What you can easily do is change your optimization method to e.g. L-BFGS-B and the method should converge properly.

enter image description here

Full code:

import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
import scipy.stats as stats

size = 300000
nbins = 30

def simu_dt():
    ## simulate Exp2 data
    np.random.seed(20)
    ## generate random values between 0 to 1
    x = np.random.rand(size)
    data = []
    for n in x:
        if n < 0.6:
            # generating 1st exp data
            data.append(np.random.exponential(scale=20)) # t1
        else:
            # generating 2nd exp data
            data.append(np.random.exponential(scale=500)) # t2
    return np.array(data)

ydata2 = simu_dt() # call to generate simulated data
## trimming the data at the beginning and the end a bit
ydata2 = ydata2[np.where(2 < ydata2)]
ydata2 = ydata2[np.where(ydata2 < 3000)]

## creating the normalized log binned histogram data
bins = 10 ** np.linspace(np.log10(np.min(ydata2)), np.log10(np.max(ydata2)), nbins)
counts, bin_edges = np.histogram(ydata2, bins=bins)
bin_centres = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_width = (bin_edges[1:] - bin_edges[:-1])
counts = counts / bin_width / np.sum(counts)

## generating arbitrary x value
x1 = np.linspace(bin_centres.min(), bin_centres.max(), len(ydata2))

def MLE(params):
    """ find the max likelihood """
    k1, k2 = params
    yPred = 0.6*k1*np.exp(-k1*ydata2) + 0.4*k2*np.exp(-k2*ydata2)
    negLL = -np.sum(np.log(yPred))
    return negLL

guess = np.array([1/30, 1/200])
bnds = ((1/100, 1/2), (1/1000, 1/100))
## best function used for global fitting

results = optimize.minimize(MLE, guess, bounds=bnds)

print(results)
K1, K2 = results.x
y_fitted = 0.6*K1*np.exp(-K1*x1) + 0.4*K2*np.exp(-K2*x1)

## plot actual data
plt.plot(bin_centres, counts, 'ko', label=" actual data")
plt.xlabel("Dwell Times (s)")
plt.ylabel("Probability")

## plot fitted data on original data
plt.plot(x1, y_fitted, c='r', linestyle='dashed', label="fit")
plt.legend()
plt.xscale('log')
plt.yscale('log')

plt.show()
Cindy Almighty
  • 903
  • 8
  • 20
  • 1
    Would you please explain why you have used "np.sum(np.log(-np.log(yPred)))" instead of "-np.sum(np.log(yPred))"? I mean why you have used log twice in full code? Otherwise, your suggestion seems working fine for me with -np.sum(np.log(yPred)). Thanks a lot for the help. – SCBera Aug 30 '19 at 15:55
  • 1
    Sorry, that was a testing version to handle some scaling issues - to handle how the low-probability values dominate the cost function for optimization. We could discuss that with someone awesome on another thread, but the image above was created with -np.sum(np.log(yPred)). I edited my answer :) – Cindy Almighty Aug 30 '19 at 22:25