KDE is very slow with large data

Question

When I try to make a scatter plot, colored by density, it takes forever.

Probably because the length of the data is quite big.

This is basically how I do it:

xy = np.vstack([np.array(x_values),np.array(y_values)])
z = gaussian_kde(xy)(xy)
plt.scatter(np.array(x_values), np.array(x_values), c=z, s=100, edgecolor='')

As an additional info, I have to add that:

>>len(x_values)
809649

>>len(y_values)
809649

Is it any other option to get the same results but with better speed results?

Have you tested whether it is the `scatter` function itself that is slow, or that the slowness happens when you run `plt.show` or `plt.savefig`? — , Jan 27 '15 at 15:34
The title is misleading. You are doing a `KDE` for a large data set. — cel, Jan 27 '15 at 15:35
totally right cel, the slowness happens in `z = gaussian_kde(xy)(xy) ` I change it — codeKiller, Jan 27 '15 at 15:37
@newPyUser what did you use instead of `gaussian_kde`? You said you would have changed it. — FaCoffee, Nov 17 '16 at 17:54
Lower the bandwidth of the KDE, use a faster kernel (e.g. linear) and don't plot 80000 points with a scatterplot. — komodovaran_, Sep 04 '18 at 08:50

score 2 · Answer 1 · answered Jul 15 '20 at 13:40

No, there is not good solutions.

Every point should be prepared, and a circle is drawn, which probably will be hidden by other points.

My tricks: (note these point may change slightly the output)

get minimum and maximum, and set image on such size, so that figure needs not to be redone.
remove data, as much as possible:
- duplicate data
- convert with a chosen precision (e.g. of floats) and remove duplicate data. You may calculate the precision with half size of the dot (or with resolution of graph, if you want the original look).
Less data: more speed. Removal is far quicker than drawing a point in a graph (which will be overwritten).
Often heatmaps are more interesting for huge data sets: it gives more information. But in your case, I think you still have too much data.

Note: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde has also a nice example (just 2000 points). In any case, this pages uses also my first point.

score 0 · Answer 2 · answered Apr 04 '21 at 10:20

I would suggest plotting a sample of the data. If the sample is large enough you should get the same distribution. Making sure the plot is relevant to the entire data set is also quite easy as you can simply take multiple samples and compare between them.

KDE is very slow with large data

2 Answers2