2

I've read that using the statsmodels.nonparametric.kde module instead of scipy.stats.gaussian_kde can lead to a substantial speed increase.

I have a simple block of code (4 lines of code) that I currently calculate making use of scipy.stats.gaussian_kde and I'd like to replace those for its equivalent in statsmodels to see if I can actually get an improvement in speed.

This is the MWE:

import numpy as np
from scipy import stats

# Generate random two-dimensional data.
def measure(n):
    m1 = np.random.normal(size=n)
    m2 = np.random.normal(scale=0.5, size=n)
    return m1+m2, m1-m2
m1, m2 = measure(20000)

# Define data limits.
xmin, xmax = m1.min(), m1.max()
ymin, ymax = m2.min(), m2.max()

# Format data correctly.
values = np.vstack([m1, m2])

# Define a certain point value.
x1, y1 = 0.5, 0.5

##############
# Replace with calls to statsmodels.nonparametric.kde from here on.

# 1- Perform a kernel density estimate on the data.
kernel = stats.gaussian_kde(values)

# 2- Get kernel value for the point.
iso = kernel((x1,y1))

# 3- Take a random sample from KDE distribution.
sample = kernel.resample(size=1000)

# 4- Filter the sample to keep only values for which
#    the kernel evaluates to less than what it does in the
#    point (x1,y1). This is the most important step to be replaced.
insample = kernel(sample) < iso

As can be seen, it's only 4 lines of code that need replacing. Unfortunately the documentation for statsmodels.nonparametric.kde is somewhat poor and I can't figure out how to make such replacements.

The last line is the most important one since it is here that most of the computation time is spent (as stated here Speed up sampling of kernel estimate).

Community
  • 1
  • 1
Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • 1
    Unfortunately, statsmodels will not help. The fast fft version in statsmodels is only for univariate KDE (and currently doesn't have a resample method). The multivariate KDE in statsmodels will be slower than scipy's gaussian_kde, I guess. – Josef Jan 07 '14 at 19:18
  • If your target is integration, then scipy's gaussian_kde has a `integrate_box` method that might be reasonably fast (in Fortran). But, I never tried. – Josef Jan 07 '14 at 19:25
  • Unfortunately `integrate_box` is of no use to me since the integration area is composed of those points which evaluate to less than `kernel(x1,y1)` which is not easily definable. – Gabriel Jan 07 '14 at 20:03

0 Answers0