scipy - generate random variables with correlations

Question

I'm working to implement a basic Monte Carlo simulator in Python for some project management risk modeling I'm trying to do (basically Crystal Ball / @Risk, but in Python).

I have a set of n random variables (all scipy.stats instances). I know that I can use rv.rvs(size=k) to generate k independent observations from each of these n variables.

I'd like to introduce correlations among the variables by specifying an n x n positive semi-definite correlation matrix.

Is there a clean way to do this in scipy?

What I've Tried

This answer and this answer seem to indicate that "copulas" would be an answer, but I don't see any reference in scipy to them.

This link seems to implement what I'm looking for, but I'm not sure if scipy has this functionality implemented already. I'd also like it to work for non-normal variables.

It seems that the Iman, Conover paper is the standard method.

Is this what you are looking for? http://stackoverflow.com/a/16025584/190597 — unutbu, Jan 01 '15 at 02:01
It appears that the recommended method (Iman-Conover) uses a multi-variate normal to do what I'm looking for, so I think your comment will probably be a large piece of the final solution (which is probably something I'll have to build by hand). — MikeRand, Jan 01 '15 at 02:32
Any chance that you could share the python code you developed for generating random variables with correlations? — Ark, May 30 '17 at 08:09
Your question is not complete, because the marginals are unspecified. According to Sklar's theorem, a distribution function is fully specified by its marginal distributions and its copula. Various copulas will produce a correlation: although the Gaussian copula is a specific choice, there are many others. — Michael Baudin, Nov 10 '20 at 09:03

score 15 · Accepted Answer · edited Sep 07 '18 at 04:11

15

If you just want correlation through a Gaussian Copula (*), then it can be calculated in a few steps with numpy and scipy.

create multivariate random variables with desired covariance, numpy.random.multivariate_normal, and creating a (nobs by k_variables) array
apply scipy.stats.norm.cdf to transform normal to uniform random variables, for each column/variable to get uniform marginal distributions
apply dist.ppf to transform uniform margin to the desired distribution, where dist can be one of the distributions in scipy.stats

(*) Gaussian copula is only one choice and it is not the best when we are interested in tail behavior, but it is the easiest to work with for example http://archive.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all

two references

https://stats.stackexchange.com/questions/37424/how-to-simulate-from-a-gaussian-copula

http://www.mathworks.com/products/demos/statistics/copulademo.html

(I might have done this a while ago in python, but don't have any scripts or function right now.)

edited Sep 07 '18 at 04:11

cbare

12,060
8
56
63

answered Jan 08 '15 at 14:25

Josef

21,998
3
54
67

Do you know of any similar, more memory efficient solutions? I'm doing this with 'cov_matrix = toeplitz(rho**arange(p))', but I'm running into memory errors when i hit high dimensions. – MHankin May 17 '16 at 15:27
How can I get uniform marginal distributions in python? – Ark May 30 '17 at 14:22
@Ark To get uniform marginal distributions you skip the last step. – Josef May 30 '17 at 17:11
What do you mean with "creating a (nobs by k_variables) array"? @Josef – JackLametta May 28 '19 at 10:40
`nobs` is number of observations. You create nobs random variables each of dimension k_variables. For data analysis we usually have observations in rows and data series or variables in columns. – Josef May 28 '19 at 13:16

stochastic · Answer 2 · 2015-01-01T03:02:30.790

It seems like a rejection-based sampling method such as the Metropolis-Hastings algorithm is what you want. Scipy can implement such methods with its scipy.optimize.basinhopping function.

Rejection-based sampling methods allow you to draw samples from any given probability distribution. The idea is that you draw random samples from another "proposal" pdf that is easy to sample from (such as uniform or gaussian distributions) and then use a random test to decide if this sample from the proposal distribution should be "accepted" as representing a sample of the desired distribution.

The remaining tricks will then be:

Figure out the form of the joint N-dimensional probability density function which has marginals of the form you want along each dimension, but with the correlation matrix that you want. This is easy to do for the Gaussian distribution, where the desired correlation matrix and mean vector is all you need to define the distribution. If your marginals have a simple expression, you can probably find this pdf with some straightforward-but-tedious algebra. This paper cites several others which do what you are talking about, and I'm certain that there are many more.
Formulate a function for basinhopping to minimize such that it's accepted "minima" amount to samples of this pdf you have defined.

Given the results of (1), (2) should be straightforward.

Jean A. · Answer 3 · 2020-11-10T09:56:17.123

If you have already a positive semi-definite correlation matrix R [n x n], it's easy to build a NormalCopula taking R as input. I'll show you an example with n = 3. The code is based on OpenTURNS library.

import openturns as ot

# you can replace this part by your matrix
dim = 3
R = ot.CorrelationMatrix (dim)
R[0,1] = 0.25
R[0,2] = 0.6
R[1,2] = 0.9

copula = ot.NormalCopula(R)

Should you like to get a sample of size, just write

size = 5
print(copula.getSample(size))
>>>    [ X0       X1       X2       ]
0 : [ 0.355353 0.76205  0.632379 ]
1 : [ 0.902567 0.984443 0.989552 ]
2 : [ 0.423219 0.811016 0.754304 ]
3 : [ 0.303776 0.471557 0.450188 ]
4 : [ 0.746168 0.918729 0.891347 ]

EDIT - Following the comment of @Michael_Baudin

Of course, if you want to set the marginal distributions as e.g. Beta and LogNormal marginals, its also possible:

X0 = ot.LogNormal(0.1, 1, 0)
X1 = ot.Beta()
X2 = ot.Uniform(1.0, 2.0)
distribution = ot.ComposedDistribution([X0,X1,X2], Original_copula)
print(distribution.getSample(size))
>>> [ X0         X1         X2         ]
0 : [  3.97678    0.158823   1.75635   ]
1 : [  1.18929   -0.554092   1.18952   ]
2 : [  2.59542    0.0751359  1.68599   ]
3 : [  1.33363   -0.18407    1.42241   ]
4 : [  1.34084    0.198019   1.6553    ]

I suggest to extend the script and set the marginal distributions with e.g. Beta and LogNormal marginals, because the question mentionned "I'd also like it to work for non-normal variables.". — Michael Baudin, Nov 10 '20 at 09:06

score 0 · Answer 4 · answered May 18 '22 at 15:42

import typing

import numpy as np
import scipy.stats


def run_gaussian_copula_simulation_and_get_samples(
        ppfs: typing.List[typing.Callable[[np.ndarray], np.ndarray]],  # List of $num_dims percentile point functions
        cov_matrix: np.ndarray,  # covariance matrix, shape($num_dims, $num_dims)
        num_samples: int,  # number of random samples to draw
) -> np.ndarray:
    num_dims = len(ppfs)

    # Draw random samples from multidimensional normal distribution -> shape($num_samples, $num_dims)
    ran = np.random.multivariate_normal(np.zeros(num_dims), cov_matrix, (num_samples,), check_valid="raise")

    # Transform back into a uniform distribution, i.e. the space [0,1]^$num_dims
    U = scipy.stats.norm.cdf(ran)

    # Apply ppf to transform samples into the desired distribution
    # Each row of the returned array will represent one random sample -> access with a[i]
    return np.array([ppfs[i](U[:, i]) for i in range(num_dims)]).T  # shape($num_samples, $num_dims)

# Example 1. Uncorrelated data, i.e. both distributions are independent
f1 = run_gaussian_copula_simulation_and_get_samples(
    [lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
    [[1, 0], [0, 1]],
    6
)
# Example 2. Completely correlated data, i.e. both percentiles match
f2 = run_gaussian_copula_simulation_and_get_samples(
    [lambda x: scipy.stats.norm.ppf(x, loc=100, scale=15), scipy.stats.norm.ppf],
    [[1, 1], [1, 1]],
    6
)
np.set_printoptions(suppress=True)  # suppress scientific notation
print(f1)
print(f2)

A few note on this function. np.random.multivariate_normal does a lot of the heavy lifting for us, note that in particular we do not need to decompose the correlation matrix. ppfs is passed as a list of functions which each have one input and one return value.

In my particular use case I needed to generate multivariate-t-distributed random variables (in addition to normal-distributed ones), consult this answer on how to do that: https://stackoverflow.com/a/41967819/2111778. Additionally, I used scipy.stats.t.cdf for the back-transform part.

In my particular use case the desired distributions were empirical distributions representing expected financial loss. The final data points then had to be added together to get a total financial loss across all of the individual-but-correlated financial events. Thus, np.array(...).T is actually replaced by sum(...) in my code base.

scipy - generate random variables with correlations

4 Answers4