17

I have a massive scatterplot (~100,000 points) that I'm generating in matplotlib. Each point has a location in this x/y space, and I'd like to generate contours containing certain percentiles of the total number of points.

Is there a function in matplotlib which will do this? I've looked into contour(), but I'd have to write my own function to work in this way.

Thanks!

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
astromax
  • 6,001
  • 10
  • 36
  • 47

2 Answers2

52

Basically, you're wanting a density estimate of some sort. There multiple ways to do this:

  1. Use a 2D histogram of some sort (e.g. matplotlib.pyplot.hist2d or matplotlib.pyplot.hexbin) (You could also display the results as contours--just use numpy.histogram2d and then contour the resulting array.)

  2. Make a kernel-density estimate (KDE) and contour the results. A KDE is essentially a smoothed histogram. Instead of a point falling into a particular bin, it adds a weight to surrounding bins (usually in the shape of a gaussian "bell curve").

Using a 2D histogram is simple and easy to understand, but fundementally gives "blocky" results.

There are some wrinkles to doing the second one "correctly" (i.e. there's no one correct way). I won't go into the details here, but if you want to interpret the results statistically, you need to read up on it (particularly the bandwidth selection).

At any rate, here's an example of the differences. I'm going to plot each one similarly, so I won't use contours, but you could just as easily plot the 2D histogram or gaussian KDE using a contour plot:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kde

np.random.seed(1977)

# Generate 200 correlated x,y points
data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 3]], 200)
x, y = data.T

nbins = 20

fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, sharey=True)

axes[0, 0].set_title('Scatterplot')
axes[0, 0].plot(x, y, 'ko')

axes[0, 1].set_title('Hexbin plot')
axes[0, 1].hexbin(x, y, gridsize=nbins)

axes[1, 0].set_title('2D Histogram')
axes[1, 0].hist2d(x, y, bins=nbins)

# Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents
k = kde.gaussian_kde(data.T)
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))

axes[1, 1].set_title('Gaussian KDE')
axes[1, 1].pcolormesh(xi, yi, zi.reshape(xi.shape))

fig.tight_layout()
plt.show()

enter image description here

One caveat: With very large numbers of points, scipy.stats.gaussian_kde will become very slow. It's fairly easy to speed it up by making an approximation--just take the 2D histogram and blur it with a guassian filter of the right radius and covariance. I can give an example if you'd like.

One other caveat: If you're doing this in a non-cartesian coordinate system, none of these methods apply! Getting density estimates on a spherical shell is a bit more complicated.

Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • 1
    This is an excellent response! My only question is now that I have a method of binning the data, how do I plot certain percentages? Do I adjust the contour levels to reflect the percentages? It's kind of like a confidence interval. – astromax Oct 15 '13 at 21:28
  • 2
    Sorry for the delay! Basically, yes, you should adjust the contour levels to reflect the percentages. The `gaussian_kde` results are an estimate of the probability density function (PDF). Therefore, contouring a value of 0.1 would imply that 90% of the data is inside the contour, etc. For the 2D histogram, the values are raw counts, so you'd need to normalize. Hopefully that helps clarify things a bit. – Joe Kington Oct 16 '13 at 02:07
  • @JoeKington that's cool. But If I got a 3D random-dataset(x,y,z), then will it be possible to apply this method ? – diffracteD Jun 10 '15 at 07:34
  • I am really late to this, but I am curious if you still have an example of code that approximates the KDE using a blur. – GWW Oct 07 '15 at 22:43
  • 1
    @GWW - Have a look at the `fast_kde` function here: https://gist.github.com/joferkington/d95101a61a02e0ba63e5 – Joe Kington Oct 13 '15 at 14:20
2

I have the same question. If you want to plot contours, which contain some part of points you can use following algorithm:

create 2d histogram

h2, xedges, yedges = np.histogram2d(X, Y, bibs = [30, 30])

h2 is now 2d matrix containing integers which is number of points in some rectangle

hravel = np.sort(np.ravel(h2))[-1] #all possible cases for rectangles 
hcumsum = np.sumsum(hravel)

ugly hack,

let give for every point in h2 2d matrix the cumulative number of points for rectangle which contain number of points equal or greater to that we analyze currently.

hunique = np.unique(hravel)

hsum = np.sum(h2)

for h in hunique:
    h2[h2 == h] = hcumsum[np.argwhere(hravel == h)[-1]]/hsum

now plot contour for h2, it will be the contour which containing some amount of all points

andrey
  • 21
  • 1