Update
I just noticed your mention of np.random.multivariate_normal
... It does in one swell swoop the equivalent of gen_like()
below!
I'll leave it here to help people understand the mechanics of this, but to summarize:
- you can match the mean and covariance of an empirical distribution with a (rotated, scaled, translated) normal;
- for a better match of higher moments, you should look at the copula.
Original answer
Since you are interested in only matching the two first moments (mean, variance), you can use a simple PCA to obtain a suitable model of the initial data. Note that the new generated data will be a normal ellipsoid, rotated, scaled, and translated to match the empirical mean and covariance of the initial data.
If you want more sophisticated "replication" of the original distribution, then you should look at Copula as I said in the comments.
So, for the first two moments only, assuming your input data is d0
:
from sklearn.decomposition import PCA
def gen_like(d0, n):
pca = PCA(n_components=d0.shape[1]).fit(d0)
z0 = pca.transform(d0) # z0 is centered and uncorrelated (cov is diagonal)
z1 = np.random.normal(size=(n, d0.shape[1])) * np.std(z0, 0)
# project back to input space
d1 = pca.inverse_transform(z1)
return d1
Example:
# generate some random data
# arbitrary transformation matrix
F = np.array([
[1, 2, 3],
[2, 1, 4],
[5, 1, 3],
])
d0 = np.random.normal(2, 4, size=(10000, 3)) @ F.T
np.mean(d0, 0)
# ex: array([12.12791066, 14.10333273, 17.95212292])
np.cov(d0.T)
# ex: array([[225.09691912, 257.39878551, 259.40288019],
# [257.39878551, 338.34087242, 373.4773562 ],
# [259.40288019, 373.4773562 , 566.29288861]])
# try to match mean, variance of d0
d1 = gen_like(d0, 10000)
np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)
np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)
What's funny is that you can fit a square peg in a round hole (i.e., demonstrating that really only mean, variance are matched, not the higher moments):
d0 = np.random.uniform(5, 10, size=(1000, 3)) @ F.T
d1 = gen_like(d0, 10000)
np.allclose(np.mean(d0, 0), np.mean(d1, 0), rtol=0.1)
# often True (but not guaranteed)
np.allclose(np.cov(d0.T), np.cov(d1.T), rtol=0.1)
# often True (but not guaranteed)