Global distribution fitting sharing some parameters without any specification of the bin size in python

Question

I have several data sets that fit separately very well to a vonMises distribution. I am looking for a way of fitting all of them sharing mu but with different kappas without to care about the election of the bins.

When one wants to fit with only one model it is quite trivial: scipy here does the fit with the raw data. But I have been looking for global fitting using symfit or lmfit, or in some posts (here and here), and in all cases we have to specify x-coordinates and y-coordinates, which means with have previously to choose some bin size for the distribution.

This is some artificial data for only two data sets that could be useful as a example of what I need, though fitting each individually, using scipy. (Note that I don't need to care about the election of the bins).

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# creating the data
mu1, mu2 = .05, -.05
sigma1, sigma2 = 3.1, 2.9
n1, n2 = 8000, 9000
y1 = np.random.vonmises(mu1, sigma1, n1)
y2 = np.random.vonmises(mu2, sigma2, n2)

# fitting
dist = st.vonmises
*args1, loc1, scale1 = dist.fit(y1, fscale=1)
*args2, loc2, scale2 = dist.fit(y2, fscale=1)

x1 = np.linspace(np.min(y1), np.max(y1), 200)
x2 = np.linspace(np.min(y2), np.max(y2), 200)

pdf_fitted1 = dist.pdf(x1, *args1, loc=loc1, scale=scale1)
pdf_fitted2 = dist.pdf(x2, *args2, loc=loc2, scale=scale2)

# plotting
plt.hist(y1, bins=40, density=True, histtype='step', color='#1f77b4')
plt.hist(y2, bins=40, density=True, histtype='step', color='#ff7f0e')
plt.plot(x1, pdf_fitted1, color='#1f77b4')
plt.plot(x2, pdf_fitted2, color='#ff7f0e')
plt.show()

I'd be glad if someone could help with that, thanks in advance. Any answer or comment would be appreciated.

score 2 · Accepted Answer · answered May 06 '19 at 20:46

Thank you for this excellent question. In principle you should be able to solve this using symfit out of the box, but your question brought to light some minor issues. symfit is very much a project in flux, so unfortunately this will happen from time to time. But I created a workaround that should work on the current master branch, and I hope to release a new version soon which will address these issues.

In principle this is a combination of the global fitting examples you already found with the LikeLihood objective function. With LogLikelihood, you don't need to bin but instead use the measurements directly. However, LikeLihood does not seem to handle multi component models properly yet so I included a fixed version of LogLikelihood.

import matplotlib.pyplot as plt
from symfit import (
    Fit, parameters, variables, exp, cos, CallableModel, pi, besseli
)
from symfit.core.objectives import LogLikelihood
from symfit.core.minimizers import *
from symfit.core.printing import SymfitNumPyPrinter

# symbolic bessel is not converted into numerical yet, this monkey-patches it.
def _print_besseli(self, expr):
    return 'scipy.special.iv({}, {})'.format(*expr.args)

SymfitNumPyPrinter._print_besseli = _print_besseli

# creating the data
mu1, mu2 = .05, -.05  # Are these supposed to be opposite sign?
sigma1, sigma2 = 3.5, 2.5
n1, n2 = 8000, 9000
np.random.seed(42)
x1 = np.random.vonmises(mu1, sigma1, n1)
x2 = np.random.vonmises(mu2, sigma2, n2)

# Create a model for `n` different datasets.
n = 2
x, *xs = variables('x,' + ','.join('x_{}'.format(i) for i in range(1, n + 1)))
ys = variables(','.join('y_{}'.format(i) for i in range(1, n + 1)))
mu, kappa = parameters('mu, kappa')
kappas = parameters(','.join('k_{}'.format(i) for i in range(1, n + 1)),
                    min=0, max=10)
mu.min, mu.max = - np.pi, np.pi  # Bound to 2 pi

# Create a model template, who's symbols we will replace for each component.
template = exp(kappa * cos(x - mu)) / (2 * pi * besseli(0, kappa))
model = CallableModel(
    {y_i: template.subs({kappa: k_i, x: x_i}) for y_i, x_i, k_i in zip(ys, xs, kappas)}
)
print(model)

class AlfredosLogLikelihood(LogLikelihood):
    def __call__(self, *args, **kwargs):
        evaluated_func = super(LogLikelihood, self).__call__(
            *args, **kwargs
        )

        ans = - sum([np.nansum(np.log(component))
                     for component in evaluated_func])
        return ans

fit = Fit(model, x_1=x1, x_2=x2, objective=AlfredosLogLikelihood)

x_axis = np.linspace(- np.pi, np.pi, 101)
fit_result = fit.execute()
print(fit_result)
x1_result, x2_result = model(x_1=x_axis, x_2=x_axis, **fit_result.params)

# plotting
plt.hist(x1, bins=40, density=True, histtype='step', color='#1f77b4')
plt.hist(x2, bins=40, density=True, histtype='step', color='#ff7f0e')
plt.plot(x_axis, x1_result, color='#1f77b4')
plt.plot(x_axis, x2_result, color='#ff7f0e')
plt.show()

This outputs the following:

[y_1(x_1; k_1, mu) = exp(k_1*cos(mu - x_1))/(2*pi*besseli(0, k_1)),
 y_2(x_2; k_2, mu) = exp(k_2*cos(mu - x_2))/(2*pi*besseli(0, k_2))]

Parameter Value        Standard Deviation
k_1       3.431673e+00 None
k_2       2.475649e+00 None
mu        1.030791e-02 None
Status message         b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
Number of iterations   13

I hope this get's you in the right direction, and thank you for bringing this bug to light ;).

Great answer @tBuLi, thanks. Although I am having this error: No module named 'symfit.core.printing'. I am using `symfit` version 0.4.6 — alpelito7, May 07 '19 at 08:49
Yeah I'm afraid you will have to use the current master branch for this to work, I don't think I can find a solution in 0.4.6. — tBuLi, May 07 '19 at 08:53
Can you please tell me where can find it? In [pypi](https://pypi.org/search/?q=symfit) I just found the version 0.4.6. — alpelito7, May 07 '19 at 08:55
Oh, yes @tBuLi, the `mu`'s are supposed to be opposite sign, but just in order to test that the algorithm was fitting well (as in this case, where `mu` is almost zero). — alpelito7, May 07 '19 at 09:18
You can find it on github: https://github.com/tBuLi/symfit. You'll need to clone the master branch and then install that in pip — tBuLi, May 07 '19 at 11:24
Glad to hear it worked @Alfredo :). As for the regression coefficient, it does not exist for likelihood fitting, hence the `nan` and the warning. In Pull Request (PR) 221 this is fixed, which will be merged soon to create symfit 0.5.0. The same goes for the standard deviations: the solution I gave you above does not implement the derivatives of the loglikelihood function so no standard deviation can be calculated. However, by now I fully fixed your issue in this [PR237](https://github.com/tBuLi/symfit/pull/237), so if you run that instead you should get them :). — tBuLi, May 08 '19 at 13:10
Oh of course, of course (I don't know what I was thinking about). Thank you very much for these last suggestions and also for your time :) (`symfit` is great btw) — alpelito7, May 08 '19 at 14:19

Global distribution fitting sharing some parameters without any specification of the bin size in python

1 Answers1