Vectorizing the multivariate normal CDF (cumulative density function) in Python

Question

How can I vectorize the multivariate normal CDF (cumulative density function) in Python?

When looking at this post, I found out that there is a Fortran implementation of the multivariate CDF that was "ported" over to Python. This means I can easily evaluate the CDF for one specific case.

However, I'm having a lot of trouble efficiently applying this function to multiple entries.

Specifically speaking, the function I need to "vectorize" takes 4 arguments:

the lower bounds of integration (vector)
the upper bounds of integration (vector)
the means of the normal random variables (vector)
the covariance matrix of the normal random variables (matrix)

But I'm trying to efficiently evaluate this function over a list of 1000+ elements MANY times.

Here is some code to illustrate my problem. In the example below I'm just using random data to illustrate my point.

import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
    
np.random.seed(666)
    
iters = 1000 # number of times the whole dataset will be evaluated
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution
     
# Creates a random correlation matrix
def gen_random_corr_mtx(d,k):
    # Source: https://stats.stackexchange.com/a/125020/215330
    #d = number of dimensions
    #k = number of factors. more factors -> smaller correlations

    W = np.random.randn(d*k).reshape((d,k))

    S_temp = np.dot(W,W.T) + np.diag(np.random.rand(d))
    S_norm = np.diag(1/np.sqrt(np.diag(S_temp)))
    S = np.dot(np.dot(S_norm,S_temp),S_norm)
   
    return(S)

# Creates a covariance matrix from a correlation matrix and 
# an array of std devs
def from_cor_to_cov(cor_mtx,st_devs):
    cor_mtx = np.array(cor_mtx)
    st_devs = np.array(st_devs)
    st_devs_diag = np.diag(st_devs)
    cov_mtx = st_devs_diag.dot(cor_mtx).dot(st_devs_diag)
    return(cov_mtx)

# Creating an array with the lower bounds of the integration
lower = np.random.rand(obs,dim)

# Creating an array with the upper bounds of the integration
upper = lower + np.random.rand(obs,dim)

# Creating an array with the means of the distributions
means = np.random.rand(obs,dim)

# Generating the random covariance matrices
covs = []
for i in range(obs):
    # Making sure the covariance matrix is positive semi-definite
    while True:
        cor_mtx = gen_random_corr_mtx(dim,2)
        st_devs = np.abs(np.random.randn(dim))
        cov_mtx = from_cor_to_cov(cor_mtx,st_devs)
        if np.all(np.linalg.eigvals(cov_mtx) > 0):
            break
    covs.append(cov_mtx)
covs = np.array(covs)
    
# Here is where the trouble starts.
time_start = time.time()
for i in range(iters):
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
        results.append(this_p)
time_end = time.time()
    
print(time_end-time_start)
# > 4.090040922164917

Here I have a dataset with 1500 observations which I'm evaluating 1000 times. In my machine, this takes 4.090040922164917 seconds to calculate.

Note that I'm not trying to get rid of the outer for-loop (over i). I just created that to mimic my real problem. The for-loop that I'm really trying to eliminate is the internal one (over j).

The execution time could be vastly reduced if I found a way to efficiently evaluate the CDF over the whole dataset.

I know that the mvnun function was originally written in Fortran (original code here) and "ported" to Python using f2pye as can be seen here.

Can anyone help me out with this? I've started looking at theano, but it seems that the only option I have there is to use the scan function, which might not be much of an improvement either.

Thank you!!!

Update (2023-04-14)

The scipy library was recently updated to allow the user to determine the lower limits of integration in the multivariate_normal.cdf() method instead of it always being the default n-dimensional array of -inf values.

Here's the updated part of the code using the new method:

import scipy 

time_start = time.time()
for i in range(iters):
    results = []
    for j in range(obs):
        this_p = scipy.stats.multivariate_normal.cdf(lower_limit=lower[j],
                                                     x=upper[j],
                                                     mean=means[j],
                                                     cov=covs[j])
        results.append(this_p)
time_end = time.time()

print(time_end-time_start)
# > 268.0528531074524

The method above is significantly slower than the one I originally proposed, taking over 4 minutes to compute. But since most of scipy's classes and methods are already set up for vectorized operations, I'm guessing/hoping the code I posted here only needs to be lightly tweaked to be truly vectorized.

I know this is an old question, but be careful when generating random covariance matrices. In order to be valid, your covariance matrix needs to be positive semi-definite (or even positive definite if you want the distribution to be nondegenerate). Scikit-learn provides a function for doing that: [make_spd_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_spd_matrix.html). Now, `mvnun` does not seem to care about that and returns a result regardless of whether the fourth argument `covs[j]` makes sense as a covariance matrix, and that is really strange. — josephmure, Oct 10 '20 at 13:23
Thanks for the heads up! I added a few changes to ensure the matrices are positive semi-definite =) — Felipe D., Apr 14 '23 at 15:28

josephmure · Answer 1 · 2021-06-10T07:21:50.270

This is only a partial answer, but there is a way to increase speed if the dimension of the multivariate normal distribution is small (2 or 3) and if the covariance matrix stays the same.

import numpy as np
import openturns as ot

def computeRectangularDomainProbability(lower, upper, means, cov_matrix):
    """
    Compute the probability of a rectangular solid
    under a multinormal distribution.
    
    """
    # Center the bounds of the rectangular solid on the mean
    lower -= means
    upper -= means
    
    # The same covariance matrix for all rectangular solids.
    cov_matrix = ot.CovarianceMatrix(cov_matrix)
    
    # This way, we only need to define one multivariate normal distribution.
    # That is the trick that allows vectorization.
    dimension = lower.shape[1]   
    multinormal = ot.Normal([0.0] * dimension, cov_matrix)
    
    # The probability of the rectangular solid is a weighted sum
    # of the CDF of the vertices (with weights equal to 1 or -1).
    # The following block computes the CDFs and applies the correct weight.
    full_reverse_binary = np.array(list(bin(2**dimension)[:1:-1]), dtype=int)
    prob = 0.0
    for i in range(2**dimension):
        reverse_binary = np.array(list(bin(i)[:1:-1]), dtype=int)
        reverse_binary = np.append(reverse_binary, 
                                   np.zeros(len(full_reverse_binary) - 
                                            len(reverse_binary) -
                                            1)).astype(int)
        point = np.zeros(lower.shape)
        for num, digit in enumerate(reverse_binary):
            if digit:
                point[:, num] = upper[:, num]
            else:
                point[:, num] = lower[:, num]
        cdf = np.array(multinormal.computeCDF(point))
        if (reverse_binary.sum() % 2) == (dimension % 2):
            prob += cdf
        else:
            prob -= cdf
    
    return prob.reshape(-1,)

Test script: dimension 2

iters = 1000 # loop size
obs = 1500 # number of rectangular solids
dim = 2 # dimension of multivariate normal distribution

import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
from sklearn.datasets import make_spd_matrix
import openturns as ot

time_mvnun = 0.0
time_openturns = 0.0
discrepancy = 0.0
np.random.seed(0)

for iteration in range(iters):

    lower = np.random.rand(obs,dim)
    upper = lower + np.random.rand(obs,dim)
    means = np.random.rand(obs,dim)
    
    # Generating the random covariance matrices with sklearn
    # to make sure they are positive semi-definite        
    cov_mtx = make_spd_matrix(dim)
        
    time_start = time.time()
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],cov_mtx)
        results.append(this_p)
    results = np.array(results)
    time_end = time.time()
    time_mvnun += time_end - time_start
    
    
    time_start = time.time()       
    otparallel = computeRectangularDomainProbability(lower, upper, means, cov_mtx)
    time_end = time.time()
    time_openturns += time_end - time_start
    
    mvnorm_vs_otparallel = np.abs(results - otparallel).sum()
    discrepancy += mvnorm_vs_otparallel

print('Dimension {}'.format(dim))

# Print computation time
print('mvnun     time: {0:e}'.format(time_mvnun))
print('openturns time: {0:e}'.format(time_openturns))
print('ratio mvnun/ot: {0:f}'.format(time_mvnun / time_openturns))

# Check that the results are the same for mvnum and openturns
print('mvnun-openturns result discrepancy: {0:e}'.format(discrepancy))

Output on my machine:

Dimension 2
mvnun     time: 4.040635e+00
openturns time: 3.588211e+00
ratio mvnun/ot: 1.126086
mvnun-openturns result discrepancy: 8.057912e-11

There is a slight speedup: a little more than 10%.

Dimension 3

Let us change the global variables controlling the script.

iters = 100 # loop size
obs = 1500 # number of rectangular solids
dim = 3 # dimension of multivariate normal distribution

Output on my machine:

Dimension 3
mvnun     time: 2.378337e+01
openturns time: 1.596872e+00
ratio mvnun/ot: 14.893725
mvnun-openturns result discrepancy: 4.537064e-03

The gain is much more significant in dimension 3: the proposed code is 15 times faster.

Dimension 4

Unfortunately, openturns slows down a lot with dimension 4. It contains a smart implementation of the CDF for dimensions 1, 2 and 3 but falls back on a slower, more generic implementation for dimensions greater than 3.

iters = 1 # loop size
obs = 15 # number of rectangular solids
dim = 4 # dimension of multivariate normal distribution

Dimension 4
mvnun     time: 7.289171e-03
openturns time: 3.689714e+01
ratio mvnun/ot: 0.000198
mvnun-openturns result discrepancy: 6.297527e-07

In dimension 4, the proposed code is slower by about 4 orders of magnitude! This is probably because in dimension 4, it needs to compute 16=2^4 CDFs per rectangular solid, and each of these computations is slower than in smaller dimensions.

Vectorizing the multivariate normal CDF (cumulative density function) in Python

Update (2023-04-14)

1 Answers1

Test script: dimension 2

Dimension 3

Dimension 4

Linked