8

In Java, I usually rely on the org.apache.commons.math3.random.EmpiricalDistribution class to do the following:

  • Derive a probability distribution from observed data.
  • Generate random values from this distribution.

Is there any Python library that provides the same functionality? It seems like scipy.stats.gaussian_kde.resample does something similar, but I'm not sure if it implements the same procedure as the Java type I'm familiar with.

Carlos Gavidia-Calderon
  • 7,145
  • 9
  • 34
  • 59
  • I think the accepted answer [here](http://stackoverflow.com/questions/485076/does-anyone-have-example-code-of-using-scipy-stats-distributions/485233#485233) has what you're looking for. – Kevin Feb 16 '16 at 14:42
  • 3
    @Kevin: the linked answer doesn't work for this case, because it assumes you already know the analytical form of your distribution, whereas this question is looking for something non-parametric. – abeboparebop Aug 07 '18 at 13:46

1 Answers1

7
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt

# This represents the original "empirical" sample -- I fake it by
# sampling from a normal distribution
orig_sample_data = np.random.normal(size=10000)

# Generate a KDE from the empirical sample
sample_pdf = scipy.stats.gaussian_kde(orig_sample_data)

# Sample new datapoints from the KDE
new_sample_data = sample_pdf.resample(10000).T[:,0]

# Histogram of initial empirical sample
cnts, bins, p = plt.hist(orig_sample_data, label='original sample', bins=100,
                         histtype='step', linewidth=1.5, density=True)

# Histogram of datapoints sampled from KDE
plt.hist(new_sample_data, label='sample from KDE', bins=bins,
         histtype='step', linewidth=1.5, density=True)

# Visualize the kde itself
y_kde = sample_pdf(bins)
plt.plot(bins, y_kde, label='KDE')
plt.legend()
plt.show(block=False)

resulting plot

new_sample_data should be drawn from roughly the same distribution as the original data (to the degree that the KDE is a good approximation to the original distribution).

jonespm
  • 382
  • 2
  • 17
abeboparebop
  • 7,396
  • 6
  • 37
  • 46
  • 2
    This is not the correct way to draw random sample representing original distribution. A proper method would be some kind of CDF transform. – Zanam May 22 '19 at 15:10
  • 1
    @Zanam: what kinds of problems would you expect from this method? I'm no stats expert, so I'm genuinely curious. – abeboparebop May 22 '19 at 15:57
  • Just the fact that empirical distribution usually never fit any standard distributions that we know of. – Zanam May 22 '19 at 16:03
  • @Zanam: the only assumption I've made is that the original data distribution can be reasonably fit by a Gaussian-smoothed KDE -- I'm not assuming that it follows any particular standard distribution. – abeboparebop May 27 '19 at 09:04
  • 1
    @Zanam can you please elaborate on the point you are trying to make? – Vlad Apr 16 '21 at 15:35