How can I vectorize the multivariate normal CDF (cumulative density function) in Python?
When looking at this post, I found out that there is a Fortran implementation of the multivariate CDF that was "ported" over to Python. This means I can easily evaluate the CDF for one specific case.
However, I'm having a lot of trouble efficiently applying this function to multiple entries.
Specifically speaking, the function I need to "vectorize" takes 4 arguments:
- the lower bounds of integration (vector)
- the upper bounds of integration (vector)
- the means of the normal random variables (vector)
- the covariance matrix of the normal random variables (matrix)
But I'm trying to efficiently evaluate this function over a list of 1000+ elements MANY times.
Here is some code to illustrate my problem. In the example below I'm just using random data to illustrate my point.
import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
np.random.seed(666)
iters = 1000 # number of times the whole dataset will be evaluated
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution
# Creates a random correlation matrix
def gen_random_corr_mtx(d,k):
# Source: https://stats.stackexchange.com/a/125020/215330
#d = number of dimensions
#k = number of factors. more factors -> smaller correlations
W = np.random.randn(d*k).reshape((d,k))
S_temp = np.dot(W,W.T) + np.diag(np.random.rand(d))
S_norm = np.diag(1/np.sqrt(np.diag(S_temp)))
S = np.dot(np.dot(S_norm,S_temp),S_norm)
return(S)
# Creates a covariance matrix from a correlation matrix and
# an array of std devs
def from_cor_to_cov(cor_mtx,st_devs):
cor_mtx = np.array(cor_mtx)
st_devs = np.array(st_devs)
st_devs_diag = np.diag(st_devs)
cov_mtx = st_devs_diag.dot(cor_mtx).dot(st_devs_diag)
return(cov_mtx)
# Creating an array with the lower bounds of the integration
lower = np.random.rand(obs,dim)
# Creating an array with the upper bounds of the integration
upper = lower + np.random.rand(obs,dim)
# Creating an array with the means of the distributions
means = np.random.rand(obs,dim)
# Generating the random covariance matrices
covs = []
for i in range(obs):
# Making sure the covariance matrix is positive semi-definite
while True:
cor_mtx = gen_random_corr_mtx(dim,2)
st_devs = np.abs(np.random.randn(dim))
cov_mtx = from_cor_to_cov(cor_mtx,st_devs)
if np.all(np.linalg.eigvals(cov_mtx) > 0):
break
covs.append(cov_mtx)
covs = np.array(covs)
# Here is where the trouble starts.
time_start = time.time()
for i in range(iters):
results = []
for j in range(obs):
this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
results.append(this_p)
time_end = time.time()
print(time_end-time_start)
# > 4.090040922164917
Here I have a dataset with 1500 observations which I'm evaluating 1000 times. In my machine, this takes 4.090040922164917 seconds to calculate.
Note that I'm not trying to get rid of the outer for-loop (over i). I just created that to mimic my real problem. The for-loop that I'm really trying to eliminate is the internal one (over j).
The execution time could be vastly reduced if I found a way to efficiently evaluate the CDF over the whole dataset.
I know that the mvnun function was originally written in Fortran (original code here) and "ported" to Python using f2pye as can be seen here.
Can anyone help me out with this? I've started looking at theano, but it seems that the only option I have there is to use the scan function, which might not be much of an improvement either.
Thank you!!!
Update (2023-04-14)
The scipy
library was recently updated to allow the user to determine the lower limits of integration in the multivariate_normal.cdf()
method instead of it always being the default n-dimensional array of -inf
values.
Here's the updated part of the code using the new method:
import scipy
time_start = time.time()
for i in range(iters):
results = []
for j in range(obs):
this_p = scipy.stats.multivariate_normal.cdf(lower_limit=lower[j],
x=upper[j],
mean=means[j],
cov=covs[j])
results.append(this_p)
time_end = time.time()
print(time_end-time_start)
# > 268.0528531074524
The method above is significantly slower than the one I originally proposed, taking over 4 minutes to compute. But since most of scipy
's classes and methods are already set up for vectorized operations, I'm guessing/hoping the code I posted here only needs to be lightly tweaked to be truly vectorized.