3

Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.

For instance, given:

x = array([ 0.62529759, -0.08202699,  0.59220673, -0.09074541,  0.05517865,
        0.20153703,  0.22773723, -0.26229708,  0.76137555, -0.61229314,
        0.27292745,  0.35596795, -0.01373896,  0.32464979, -0.22932331,
        1.14796175,  0.17268531,  0.40692172,  0.13846154,  0.22752953,
        0.13087359,  0.14111479, -0.09932381,  0.12800392,  0.02605917,
        0.18776078,  0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
       -0.09171721,  0.23083624, -0.21603973,  0.05425592,  0.47910286,
        0.26359565, -0.19917942,  0.40182097, -0.0797546 ,  0.47239264,
       -0.36654449,  0.4513859 , -0.00282486, -0.13950512, -0.05375369,
        0.03331833,  0.48951555, -0.13760504,  2.788     , -0.15017848,
        0.02930675,  0.10910646,  0.03868301, -0.048482  ,  0.7277376 ,
        0.08841259, -0.10968462,  0.50371324,  0.86379698,  0.01674877,
        0.19542421, -0.06639165,  0.74500856, -0.10148342,  0.02482331,
        0.79195804,  0.40401969,  0.25120005,  0.21020794, -0.01767013,
       -0.13453783, -0.09605592, -0.88044229,  0.04689623,  0.09043851,
        0.21232286,  0.34129982, -0.3736799 ,  0.17313858])

I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.

DavidG
  • 24,279
  • 14
  • 89
  • 82
Brian G
  • 65
  • 1
  • 1
  • 4
  • https://stackoverflow.com/questions/41974615/how-do-i-calculate-pdf-probability-density-function-in-python – Joe Aug 02 '17 at 15:28
  • @Joe I don't have the explicit representation of the density function, so I can't pass in anything for scipy integrate. Is there a workaround with just the data? – Brian G Aug 02 '17 at 15:30
  • calculate the cdf from the data? https://stackoverflow.com/a/41980574/7919597 then pdf? – Joe Aug 02 '17 at 15:34
  • https://stats.stackexchange.com/questions/23/finding-the-pdf-given-the-cdf – Joe Aug 02 '17 at 15:36
  • @Joe thanks for the tip, looks useful. However, my data is not normal so using scipy.norm would provide very inaccurate estimates. If possible, I could equivalently use any way to fit an explicit function to my data such that it can return an equation for me to integrate. Would you know of anything of that nature? – Brian G Aug 02 '17 at 15:38
  • https://en.wikipedia.org/wiki/Empirical_distribution_function and https://stats.stackexchange.com/questions/15891/what-is-the-proper-way-to-estimate-the-cdf-for-a-distribution-from-samples-taken – Joe Aug 02 '17 at 15:50
  • @Joe Thanks! That's great help :) – Brian G Aug 02 '17 at 15:53
  • `statsmodels` has an implementation of the empirical CDF: http://www.statsmodels.org/dev/generated/statsmodels.distributions.empirical_distribution.ECDF.html – Warren Weckesser Aug 02 '17 at 16:09
  • if you google for "empirical cdf python" you will find the functions – Joe Aug 02 '17 at 16:45
  • Maybe this http://quanteconpy.readthedocs.io/en/latest/_modules/quantecon/ecdf.html, or this https://stackoverflow.com/a/33346366/7919597 or this https://stackoverflow.com/a/36355843/7919597 – Joe Aug 02 '17 at 16:55

3 Answers3

5

I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation

And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation

Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:

kde = stats.gaussian_kde(np.array(x))

#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')

#get probability
kde.integrate_box_1d( 0.3, np.inf)

TLDR: Calculate a KDE, then use the KDE as if it were a PDF.

Mati K
  • 76
  • 1
  • 4
4

You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:

import openturns as ot 
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)

That's it, now you have a distribution object :)

This library is very cool for statistics! (I am not related to them).

tupui
  • 5,738
  • 3
  • 31
  • 52
  • Thanks! I'm not really familiar with openturns, and from what I can see there isn't an immediate method to calculate cdf of a value after building with the kernel. Are you aware of how to use the distribution object you mentioned (estimated) to calculate cdf of future samples? – Brian G Aug 02 '17 at 17:17
  • @BrianG yes you can. Here is the doc of a distribution object: http://openturns.github.io/openturns/latest/user_manual/_generated/openturns.Distribution.html#openturns.Distribution you have a `computeCDF` method :) Consoder accepting the answer to mark it resolved. – tupui Aug 02 '17 at 17:20
  • @BrianG your welcome. I know pretty well OT so do not hesitate to post questions about it – tupui Aug 02 '17 at 17:50
  • Hey, I actually encountered this problem when trying to use it: – Brian G Aug 02 '17 at 17:57
  • We would better [chat] – tupui Aug 02 '17 at 17:57
  • I actually don't think I have the reputation to chat :( any other way I can ask you? – Brian G Aug 02 '17 at 18:00
  • @BrianG ask another question and I'll reply then – tupui Aug 02 '17 at 18:01
  • https://stackoverflow.com/questions/45468050/openturns-error-when-trying-to-use-kernel-build - me on a different account because I'm at my question limit lol – Brian G Aug 02 '17 at 18:07
0

We have first to create the Sample from the Numpy array. Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).

import numpy as np
x = np.array([ 0.62529759, -0.08202699,  0.59220673, -0.09074541,  0.05517865,
        0.20153703,  0.22773723, -0.26229708,  0.76137555, -0.61229314,
        0.27292745,  0.35596795, -0.01373896,  0.32464979, -0.22932331,
        1.14796175,  0.17268531,  0.40692172,  0.13846154,  0.22752953,
        0.13087359,  0.14111479, -0.09932381,  0.12800392,  0.02605917,
        0.18776078,  0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
       -0.09171721,  0.23083624, -0.21603973,  0.05425592,  0.47910286,
        0.26359565, -0.19917942,  0.40182097, -0.0797546 ,  0.47239264,
       -0.36654449,  0.4513859 , -0.00282486, -0.13950512, -0.05375369,
        0.03331833,  0.48951555, -0.13760504,  2.788     , -0.15017848,
        0.02930675,  0.10910646,  0.03868301, -0.048482  ,  0.7277376 ,
        0.08841259, -0.10968462,  0.50371324,  0.86379698,  0.01674877,
        0.19542421, -0.06639165,  0.74500856, -0.10148342,  0.02482331,
        0.79195804,  0.40401969,  0.25120005,  0.21020794, -0.01767013,
       -0.13453783, -0.09605592, -0.88044229,  0.04689623,  0.09043851,
        0.21232286,  0.34129982, -0.3736799 ,  0.17313858])
import openturns as ot 
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)

which prints:

0.29136124840835353
Michael Baudin
  • 1,022
  • 10
  • 25