24

I am trying to fit a gamma distribution to my data points, and I can do that using code below.

import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)

I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).

To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.

For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.

Any thoughts on using fit() for this purpose?

das-g
  • 9,718
  • 4
  • 38
  • 80
Sahil M
  • 1,790
  • 1
  • 16
  • 31
  • 11
    could you just construct an emprical cdf from your data and fit it to the gamma cdf using eg `curve_fit`, http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html ? – ev-br Sep 17 '13 at 12:43
  • 2
    I think your `ss.gamma.fit(dataPoints, floc=0)` doesn't produce any meaningful results, because your dataPoints are not a sample from a gamma distribution. `fit` can only estimate the distribution parameters from a sample points. Follow Zhenya's advise if you just want to fit the shape of the cdf. Or, fit sample points to a truncated version of the gamma distribution. – Josef Sep 17 '13 at 16:55
  • 1
    @Zhenya Yes, that I thought of as the last resort, if a function to do this is not available, for multiple reasons, one being that I wanted to use MLE for this as opposed to Least squares. And secondly the cdf for gamma is slightly non-trivial (but of-course possible). Thirdly, do you know how I can constrain the fit as mentioned in the question? – Sahil M Sep 17 '13 at 18:27
  • @user333700 I have used this only as an example, I am not really trying to fit something that is deterministically not gamma to gamma. And what do you mean by fit to the shape of cdf, and truncated version of gamma distribution? – Sahil M Sep 17 '13 at 18:31
  • 1
    If you have sample points that are only from part of the distribution, you could define a new truncated distribution pdf_trunc(x) = pdf(x) / cdf(truncation_point) with truncation point = ppf(0.6, known parameters) and estimate those. If you have several pieces, you can stitch them together as a mixture distribution. (Assuming I understand your question correctly.) – Josef Sep 17 '13 at 19:39
  • @user333700 Could you elaborate a bit on why pdf_trunc(x) = pdf(x) / cdf(truncation_point) represents the truncated distribution mathematically? Also, I am sorry, I am not familiar with ppf, are you referring to piecewise polynomial interpolation? – Sahil M Sep 17 '13 at 20:01
  • 2
    http://en.wikipedia.org/wiki/Truncated_distribution and `ppf` is what the inverse cdf (quantile function) is called in scipy.stats.distributions. – Josef Sep 17 '13 at 20:05
  • @user333700 Yeah, truncated distributions is a good recommendation, thank you. I need some time to ponder more over it to understand it in detail. I will get back here tomorrow, as it is 2 in the night here! – Sahil M Sep 17 '13 at 20:56
  • Quite interesting, I suspect fitting a `pdf` and `cdf` is not equivalent, under most common error functions (euclidean, manhattan, etc). Does anyone have a good link that address this problem? – Dima Tisnek Dec 12 '14 at 10:35
  • 2
    As @qarma noted, fitting data points to a ``cdf`` is not without problems, since it adds additional semantics compared to a conventional estimator. One reason is that a fit to a ``cdf`` estimator is not invariant to coordinate transformations (e.g., ``F(x) => F(-x)``, or rotations in the multivariate case), since the direction of integration (e.g., ``x`` or ``-x``) matter. @Benjamin, can you give more insights, what the motivation is for looking at ``cdf``s instead of ``pdf``s? – Dietrich Mar 02 '15 at 22:05
  • Btw., I agree with @ev-br that generic `curve_fit` is probably a better way to go about this. – Dima Tisnek Mar 03 '15 at 10:11
  • Could you provide us some more information on the problem. It sounds like you're fitting several of these 'truncated' distributions and then combining them in some way. I think the answer to your question depends on how you plan to use the resulting combined distribution. – Dan Frank Mar 28 '15 at 19:14

1 Answers1

4

I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.

Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.

Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.

http://scikit-learn.org/stable/modules/density.html http://en.wikipedia.org/wiki/Kernel_density_estimation

For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.

from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints) 

A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.

Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:

http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html

Hope this helps!

Azmy Rajab
  • 205
  • 1
  • 7