1

I have an input dataframe df_input with 10 variables and 100 rows. This data are not normal distributed. I would like to generate an output dataframe with 10 variables and 10,000 rows, such that the covariance matrix and mean of the new dataframe are the same as those of the original one. The output variables should not be normal distributed, but rather have a distribution similar to the input variables. That is: Cov(df_output) = Cov(df_input) and mean(df_ouput) = mean(df_input) Is there a Python function that does it?

Note: np.random.multivariate_normal(mean_input,Cov_input,10000) does almost this, but the output variables are normal distributed, whereas I need them to have the same (or similar) distribution as the input.

Rick
  • 79
  • 9
  • 1
    There [is a question](https://stackoverflow.com/q/58490211/827519) from 1 year and 3 months ago which is very similar to this one. @pjs was asking for clarification there as to which non-normal distribution is to be used. – wsdookadr Jan 24 '21 at 23:52
  • 1
    You may want to look at the concept of [copula](https://en.wikipedia.org/wiki/Copula_(probability_theory)). It is a fundamental tool for such problems, especially in multivariate settings. However, if you are just interested in the first 2 moments (mean, var), then just make a factor model and generate more data following it. – Pierre D Jan 25 '21 at 00:39
  • @pjs: the answer to "which non-normal distribution to be used" is: "the new distribution should follow the same distribution as the original one, whatever it is. That is, it can have fat tails, for instance, or anything else". The goal is to basically to increase the dataset following the same distribution and correlations as in the original dataset. – Rick Jan 25 '21 at 17:38

4 Answers4

1

Update

I just noticed your mention of np.random.multivariate_normal... It does in one swell swoop the equivalent of gen_like() below!

I'll leave it here to help people understand the mechanics of this, but to summarize:

  1. you can match the mean and covariance of an empirical distribution with a (rotated, scaled, translated) normal;
  2. for a better match of higher moments, you should look at the copula.

Original answer

Since you are interested in only matching the two first moments (mean, variance), you can use a simple PCA to obtain a suitable model of the initial data. Note that the new generated data will be a normal ellipsoid, rotated, scaled, and translated to match the empirical mean and covariance of the initial data.

If you want more sophisticated "replication" of the original distribution, then you should look at Copula as I said in the comments.

So, for the first two moments only, assuming your input data is d0:

from sklearn.decomposition import PCA

def gen_like(d0, n):
    pca = PCA(n_components=d0.shape[1]).fit(d0)
    z0 = pca.transform(d0)  # z0 is centered and uncorrelated (cov is diagonal)
    z1 = np.random.normal(size=(n, d0.shape[1])) * np.std(z0, 0)

    # project back to input space
    d1 = pca.inverse_transform(z1)
    return d1

Example:

# generate some random data

# arbitrary transformation matrix
F = np.array([
    [1, 2, 3],
    [2, 1, 4],
    [5, 1, 3],
])
d0 = np.random.normal(2, 4, size=(10000, 3)) @ F.T

np.mean(d0, 0)
# ex: array([12.12791066, 14.10333273, 17.95212292])

np.cov(d0.T)
# ex: array([[225.09691912, 257.39878551, 259.40288019],
#            [257.39878551, 338.34087242, 373.4773562 ],
#            [259.40288019, 373.4773562 , 566.29288861]])
# try to match mean, variance of d0
d1 = gen_like(d0, 10000)

np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)

np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)

What's funny is that you can fit a square peg in a round hole (i.e., demonstrating that really only mean, variance are matched, not the higher moments):

d0 = np.random.uniform(5, 10, size=(1000, 3)) @ F.T
d1 = gen_like(d0, 10000)

np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)

np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • If I understand well, your gen_like generates a normal distribution always, even if d0 is not-normal. But what I need is to generate data that has the same distribution as d0 (whatever it is), while preserving cov(d0.T). Am I missing a point? – Rick Jan 25 '21 at 17:26
  • 1
    The original spec was "`Cov(df_output) == Cov(df_input)` and `mean(df_ouput) = mean(df_input)`". My point was that a normal multivariate can always satisfy that. In order to satisfy the additional and less precise "but have a distribution similar to the original data", you need something like the Copula, which are essentially multidimensional cumulative density functions. See e.g. [this repo](https://github.com/sdv-dev/Copulas) (I haven't tested it, just looked at some of the Google results for "copula python"). – Pierre D Jan 25 '21 at 17:37
  • Yes, indeed it seems that copulas is the way. I am checking it. Thanks! – Rick Jan 25 '21 at 17:59
  • Please see my answer to my posted question below. Thanks a lot for your idea, it works very well. – Rick Jan 27 '21 at 22:32
0

Have you tried looking at NumPy docs?: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html

John
  • 871
  • 2
  • 9
  • 20
  • This generates a NORMAL distribution, whereas I need an output that have same os similar distribution as the input. – Rick Jan 24 '21 at 23:49
  • @Rick are there methods/techniques in statistics to detect the distribution of a dataset? Ok looks like there are [such methods](https://stats.stackexchange.com/a/219143/90824) – wsdookadr Jan 24 '21 at 23:57
  • @wsdookadr: To check against Normal, I simply use qq plots and look at skewness and kurtosis, probably there are other methods. I dont know how to detect in general , as in your question... – Rick Jan 25 '21 at 17:34
0

Have you considered using a GAN (generative adversarial network)? Takes a bit more effort than just using a predefined function, but essentially it does exactly what you are hoping to do. Here's the original paper: https://arxiv.org/abs/1406.2661

There are many PyTorch/Tensorflow codes that you can download and fit to your purposes, for example this one: https://github.com/eriklindernoren/PyTorch-GAN

Here is also a blog post I found quite helpful with an introduction to GANs. https://medium.com/ai-society/gans-from-scratch-1-a-deep-introduction-with-code-in-pytorch-and-tensorflow-cb03cdcdba0f

Maybe GAN is overkill for this problem and there are simpler methods for upscaling the sample size, in which case I'd be interested to learn about them.

Mafu
  • 84
  • 4
0

The best method is indeed to use Copulas, as suggested by many. A simple description is in the link below, which provides also a simple python code. The method preserves covariances, while augmenting the data. It allows generalization to non-symetric or non-normal distributions. Thanks for all for helping.

https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html.

Rick
  • 79
  • 9