3

I am looking for a function to compute the CDF for a multivariate normal distribution. I have found that scipy.stats.multivariate_normal have only a method to compute the PDF (for a sample x) but not the CDF multivariate_normal.pdf(x, mean=mean, cov=cov)

I am looking for the same thing but to compute the cdf, something like: multivariate_normal.cdf(x, mean=mean, cov=cov), but unfortunately multivariate_normal doesn't have a cdf method.

The only thing that I found is this: Multivariate Normal CDF in Python using scipy but the presented method scipy.stats.mvn.mvnun(lower, upper, means, covar) doesn't take a sample x as a parameter, so I don't really see how to use it to have something similar to what I said above.

Community
  • 1
  • 1
eLearner
  • 325
  • 6
  • 18
  • 1
    Start checking [this](http://statsmodels.sourceforge.net/stable/generated/statsmodels.sandbox.distributions.extras.mvnormcdf.html#statsmodels.sandbox.distributions.extras.mvnormcdf). It's a high-quality library (if you are not familiar with it) – sascha Nov 06 '16 at 23:29
  • @sascha the same question I asked for `scipy.stats.mvn.mvnun` also apply to one that you provided in this link. – eLearner Nov 07 '16 at 10:24
  • So what exactly do you want? You want to *fit* a distribution to points? – sascha Nov 07 '16 at 10:40
  • @sascha No. I already explained that: I have a mean (vector) and co-variance matrix which define a multivariate normal distribution. Given a new data point x (vector) I want to compute its cumulative probability (CDF) not the probability density (PDF). – eLearner Nov 10 '16 at 12:20
  • No idea about this question ? – eLearner Nov 18 '16 at 10:43
  • Well. Check the Matlab-function doing this. There is a referenced paper. Seems you have to implement this yourself. – sascha Nov 18 '16 at 10:44
  • Actually, the method that I referred to `scipy.stats.mvn.mvnun(..)` assumethat the multivariate normal distribution is centered at the origin and that you’ve normalized all the variances, and that's why it doesn't take a data point `x` as parameter (I read this here: http://www.nhsilbert.net/source/2014/04/multivariate-normal-cdf-values-in-python/). So, I am sure that there is some way to use this method and a new data point `x`, in order to get a probability that we want. I just don't know how. – eLearner Nov 18 '16 at 13:19

2 Answers2

2

This is just a clarification of the points that @sascha made above in the comments for the answer. The relevant function can be found here:

As an example, in a multivariate normal distribution with diagonal covariance the cfd should give (1/4) * Total area = 0.25 (look at the scatterplot below if you don't understand why) The following example will allow you to play with it:

from statsmodels.sandbox.distributions.extras import mvnormcdf
from scipy.stats import mvn

for i in range(1, 20, 2):
    cov_example = np.array(((i, 0), (0, i)))
    mean_example = np.array((0, 0))
    print(mvnormcdf(upper=upper, mu=mean_example, cov=cov_example))

The output of this is 0.25, 0.25, 0.25, 0.25...


enter image description here

Heberto Mayorquin
  • 10,605
  • 8
  • 42
  • 46
1

The CDF of some distribution is actually an integral over the PDF of that distribution. That being so, you need to provide the function with the boundaries of the integral.

What most people mean when they ask for a p_value of some point in relation to some distribution is:

what is the chance of getting these values or higher given this distribution?

Note the area marked in red - it is not a point, but rather an integral from some point onwards:

enter image description here

Accordingly, you need to set your point as the lower boundary, +inf (or some arbitrarily high enough value) as the upper boundary and provide the means and covariance matrix you already have:

from sys import maxsize

def mvn_p_value(x, mu, cov_matrix):
    upper_bounds = np.array([maxsize] * x.size)  # make an upper bound the size of your vector
    p_value = scipy.stats.mvn.mvnun(x, upper_bounds, mu, cov_matrix)[1]
    if 0.5 < p_value:  # this inversion is used for two-sided statistical testing
        p_value = 1 - p_value
    return p_value
redlus
  • 2,301
  • 2
  • 12
  • 16