5

In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.

For example if i have an array like below:

x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]

I couldn't apply for this array:

from scipy.stats import nbinom
param = nbinom.fit(x)

But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Salih
  • 719
  • 1
  • 6
  • 12
  • What do you mean, there is no support? What about https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html? – mkrieger1 Dec 12 '19 at 16:00
  • i edited my question @mkrieger1 – Salih Dec 12 '19 at 16:04
  • What is `x` supposed to mean? What did you expect `nbinom.fit(x)` to do? `scipy.stats.nbinom` has no `fit` method. – mkrieger1 Dec 12 '19 at 16:28
  • 2
    i know that "no fit method". i want to learn is there any way to fit these discrete distributions and getting its parameters or not... @mkrieger1 – Salih Dec 12 '19 at 16:30
  • FYI: For Poisson, see https://stackoverflow.com/questions/37500406/how-to-fit-a-poisson-distribution-with-seaborn/37500643#37500643 – Warren Weckesser Dec 12 '19 at 16:49
  • thank you, burlt i am trying to find fitting for all discrete distributions and piking the best fitting @Weckesser. – Salih Dec 12 '19 at 17:08
  • 2
    There is no generic method to fit arbitrary discrete distribution, as there is an infinite number of them, with potentially unlimited parameters. There are methods to fit a particular distribution, though, e.g. Method of Moments. If you only need these three I can show how to use it – Marat Dec 12 '19 at 17:27
  • @Marat, i d like that. it can be super-helpful for me, thank you. – Salih Dec 12 '19 at 17:56
  • @mkrieger1 I know it's not your responsibility or anything, but the decision to omit 'fit' methods from all discrete distributions seems like a pretty weak design choice. Surely it's reasonable to provide methods for the commonly-used discrete distributions. Oh well, I won't worry about it. – Robert Dodier Dec 12 '19 at 20:26

2 Answers2

8

You can use Method of Moments to fit any particular distribution.

Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.

So, in all these cases we only need two moments. Let's get them:

import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom

x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {}  # we'll use it later

Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.

Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:

# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:

p = 1 - mean / var  # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p

UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:

p = mean / var
r = p * mean / (1-p)

END OF UPD

UPD2:

I'd suggest using @thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.

END OF UPD2

Calculate likelihood:

likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()

Same for Poisson, there is only one parameter:

# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()

Same for Geometric distribution:

# mean = 1 / p  # this form fits the scipy definition
p = 1 / mean

likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()

Finally, let's get the best fit:

best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])

Let me know if you have any questions

Marat
  • 15,215
  • 2
  • 39
  • 48
  • thank you so much. I have one more question, I'd appreciate it if you could answer it. I know that my dataset is discrete but let's say I wanted to see if it fits the normal distribution or not. Is it possible? Is there any way to do this like Method of Moments? – Salih Dec 12 '19 at 20:59
  • and also, can we apply this method to mixture models? like binomial mixture? – Salih Dec 12 '19 at 21:14
  • 1
    @Salih yes, it works for continuous distributions like Gaussian. In some cases, however, it is hard or even impossible to estimate all parameters. For example, for a mixture of two binomials you'll need three parameters and thus three moment; it is already unpleasant to solve. It gets even worse once you add more components into the mixture. – Marat Dec 12 '19 at 22:03
  • @Maral, how do I do it if I want to do it for binomial mixture for two? Could you please show me the way? – Salih Dec 13 '19 at 07:16
  • 1
    @Salih Sorry, but I will not. It already deviates from the scope of the original question, so maybe it's better to post a separate question for that. Also, it is actually five parameters, not three, and as I mentioned it gets really unpleasant to solve it in a closed form. In practice, mixture models are usually estimated using EM algorithm and gaussians. – Marat Dec 13 '19 at 15:03
  • @Maral, are you sure for negative binomial above code? When I generate random data, it does not give the correct result for that distribution. – Salih Dec 13 '19 at 16:00
  • @Salih, the posted code actually doesn't work for the questions data, as p>1, right? What is the previous step to take before implementing your algorithm? – Alfonso Santiago Jan 18 '23 at 15:17
1

Great answer by Marat.

In addition to Marat's post I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution

I would rewrite the code for Negative Binomial to-

log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()

Note that I have used-

  • logpmf instead of pmf
  • sum instead of product

And to find out the best distribution-

best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])
thilak g
  • 11
  • 2