0

I am trying to calculate the kernel-density estimate of a fairly large two-dimensional dataset to colour the points in a scatter plot. The function scipy.stats.gaussian_kde takes a long time so I figured I could use dask (v0.15.2) to get the result faster. However, I am unsure if my approach actually got any speed-up. Here is an example:

import numpy as np
from scipy.stats import gaussian_kde
import dask.bag as db

xy = np.random.rand(2, 1000000)
kde = gaussian_kde(xy)

chunker = (xy[:, i:i+10000] for i in range(100))

compute_job = db.from_sequence(chunker).map(kde)

results = compute_job.compute()
z = np.hstack(results)

This takes over 60hrs to complete on a quad-core Xeon E5-2609 @ 2.4Hz with a dataset of 2,677,920 coordinate pairs. Am I using dask correctly?

Pablo
  • 983
  • 10
  • 24

1 Answers1

0

Unfortunately, Dask does not offer speed-ups in all cases. In fact, if you perform the KDE with only one of your input chunks, you will find that it is already using multiple cores - so Dask has no spare capacity to pick up.

Doing a KDE (like a convolution) with a kernel of size 2x1000000 seems unwise, I am not surprised that it is taking very long. Are you sure this is what you want to do?

Furthermore, may I take the opportunity for you to consider using datashader, which works chunk-wise with Dask-arrays, and includes nice blurring pipeline elements.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Thank you for your reply. The reason I want to calculate the KDE is to colour the scatter plot. I would use datashader, but [unfortunately it is not yet fully supported in matplotlib](https://github.com/bokeh/datashader/pull/200). – Pablo Dec 13 '17 at 14:54
  • I am not questioning that you want some sort of KDE, I am questioning that you want to compute that with a kernel much much bigger than any of your data chunks. You are essentially only doing a [hexbin](https://matplotlib.org/examples/pylab_examples/hexbin_demo.html) with blur. – mdurant Dec 13 '17 at 16:09
  • I have to admit that I'm just following [the "recipe"](https://stackoverflow.com/a/20107592/1534504) blindly, with little understanding underlying statistics. I agree, a hexbin is a more appropriate solution for this amount of data. Thanks for the guidance. – Pablo Dec 15 '17 at 11:24