I am trying to calculate the kernel-density estimate of a fairly large two-dimensional dataset to colour the points in a scatter plot. The function scipy.stats.gaussian_kde
takes a long time so I figured I could use dask (v0.15.2) to get the result faster. However, I am unsure if my approach actually got any speed-up. Here is an example:
import numpy as np
from scipy.stats import gaussian_kde
import dask.bag as db
xy = np.random.rand(2, 1000000)
kde = gaussian_kde(xy)
chunker = (xy[:, i:i+10000] for i in range(100))
compute_job = db.from_sequence(chunker).map(kde)
results = compute_job.compute()
z = np.hstack(results)
This takes over 60hrs to complete on a quad-core Xeon E5-2609 @ 2.4Hz with a dataset of 2,677,920 coordinate pairs. Am I using dask correctly?