0

there I read some of the threads here, and I am still confused.

I thought the scipy stats (continuous random variable) function, stats.rv_name.pdf(x, loc, scale, *params) should give a sum of 1.

I basically fitted a scatter-plot data using the code below. I do get a cumulative value of 1.0 (eventually). But my pdf_fitted does not sum to one.

I still don't understand why that is, and how I can get the arguments in the pdf output such that it can sum to one.

There is a relevant thread here : Why does scipy.norm.pdf sometimes give PDF > 1? How to correct it?

def py_DistEstimate(arr1, disType, reSults='params', bins = 20):
    dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']

    dist = getattr(stats, disType)
    param = dist.fit(arr1)
    x = linspace(min(arr1), max(arr1), bins)
    pdf_fitted = dist.pdf(x, loc=param[-2], scale=param[-1], *param[:-2])
    cdf_fitted = dist.cdf(x, loc=param[-2], scale=param[-1], *param[:-2])

    if reSults == 'pdf':
        digitizeV = np.digitize(arr1, x, right = True)
        bin_counV = np.bincount(digitizeV, weights = None)
        bin_probV = bin_counV/len(arr1)
        return pd.DataFrame({'x-axis':x, 'pdf':pdf_fitted, 'original':bin_probV, 'cdf':cdf_fitted})
    elif reSults == 'params':
        parameter_names = [p for p in inspect.signature(dist._pdf).parameters if not p=='x'] + ["loc","scale"]
        return pd.DataFrame({'names':parameter_names, 'values':param})
    elif reSults == 'listparams':   
        dist_continu = [d for d in dir(stats) if isinstance(getattr(stats, d), stats.rv_continuous)]
        return dist_continu
Kiann
  • 531
  • 1
  • 6
  • 20
  • 3
    The [*integral*](https://en.wikipedia.org/wiki/Integral) of the PDF over its domain is 1. The sum of a sample of the PDF is not, in general 1. You'll get a value that is *approximately* 1 if you multiply each sample by width of the sample interval and sum those values. – Warren Weckesser Jul 17 '18 at 22:47
  • hi Warren, yes... the integral of a pdf function, from x = -infinity to +infinity = 1.0. That I could see and double-checked by using cdf_fitted = dist.cdf()... The cdf function generates the cumulative probability, which is just the integral. ... however, my understanding is that the pdf under scipy is the normalized instantaneous probability... so I thought the sum of all the given pdf points should be 1.0; because the sum of all points is effectively taking the integral, but over discrete buckets. – Kiann Jul 17 '18 at 22:55
  • *"... the sum of all the given pdf points should be 1.0"* Think about that some more. If you included *more* points, and then summed them, the sum would be even bigger. So why should the particular set of points that you initially chose sum to 1? In fact, some of the PDF values will be *larger than 1*. – Warren Weckesser Jul 17 '18 at 23:02
  • *"... because the sum of all points is effectively taking the integral, but over discrete buckets."* Not quite. You have to weight each point by the length of the corresponding "bucket" to get something that will approximate the integral. In other words, you are computing a [Riemann sum](https://en.wikipedia.org/wiki/Riemann_sum). – Warren Weckesser Jul 17 '18 at 23:04
  • (In a previous comment, I said "some of the PDF values will be larger than 1". I should have said "some of the PDF values can be larger than 1, depending on the parameters of the distribution".) – Warren Weckesser Jul 17 '18 at 23:11
  • thanks Warren. Think the discussion is veering slightly towards a more mathematical nature (which is always interesting) :-)... but then along this line, so that we're on the same page.... my understanding is that 1. consider the probability a continuous r.v. (random variable) x = given value X, under a (pdf) distribution function, f = exp(-x^2/2)/sqrt(2*pi() – Kiann Jul 19 '18 at 16:14
  • then 2. the probability X1 – Kiann Jul 19 '18 at 16:15
  • then hence 3. the probability of -infinity < x < +infinity = integral[-inf, +inf] f(x) dx – Kiann Jul 19 '18 at 16:16
  • and (I believe), by definition, this integral should be 1.0... and I think that's why me and a lot of people get confused why the sum of pdf given by this python function doesn't sum to 1.0... – Kiann Jul 19 '18 at 16:17
  • the usual code is to use x-axis = lnspc[lower-bound, upper-bound, bins] and then .pdf(x, x-axis).... hence, I thought summing up the given pdfs is taking the integral... now, if we were to take the sumproduct of bands * pdf... it sounds like taking integral [-infinity, + infinity] f(x) * x dx... which though ,is the definition of the expectation of the variable, i.e. the mean.... – Kiann Jul 19 '18 at 16:19
  • operationally, in the main, I think quite a few of us are just looking for the pdf function such that it is proven to sum up to 1.0... I actually did a work-around of generating the .cdf first, then using 'pdf' = numpy.ediff1d(.cdf, to_end = 0.0) .... – Kiann Jul 19 '18 at 16:21

0 Answers0