Efficiently create a density plot for high-density regions, points for sparse regions

Question

I need to make a plot that functions like a density plot for high-density regions on the plot, but below some threshold uses individual points. I couldn't find any existing code that looked similar to what I need in the matplotlib thumbnail gallery or from google searches. I have a working code I wrote myself, but it is somewhat tricky and (more importantly) takes an unacceptably long time when the number of points/bins is large. Here is the code:

import numpy as np
import math
import matplotlib as mpl
import matplotlib.pyplot as plt
import pylab
import numpy.random

#Create the colormap:
halfpurples = {'blue': [(0.0,1.0,1.0),(0.000001, 0.78431373834609985, 0.78431373834609985),
(0.25, 0.729411780834198, 0.729411780834198), (0.5,
0.63921570777893066, 0.63921570777893066), (0.75,
0.56078433990478516, 0.56078433990478516), (1.0, 0.49019607901573181,
0.49019607901573181)],

    'green': [(0.0,1.0,1.0),(0.000001,
    0.60392159223556519, 0.60392159223556519), (0.25,
    0.49019607901573181, 0.49019607901573181), (0.5,
    0.31764706969261169, 0.31764706969261169), (0.75,
    0.15294118225574493, 0.15294118225574493), (1.0, 0.0, 0.0)],

    'red': [(0.0,1.0,1.0),(0.000001,
    0.61960786581039429, 0.61960786581039429), (0.25,
    0.50196081399917603, 0.50196081399917603), (0.5,
    0.41568627953529358, 0.41568627953529358), (0.75,
    0.32941177487373352, 0.32941177487373352), (1.0,
    0.24705882370471954, 0.24705882370471954)]} 

halfpurplecmap = mpl.colors.LinearSegmentedColormap('halfpurples',halfpurples,256)

#Create x,y arrays of normally distributed points
npts = 1000
x = numpy.random.standard_normal(npts)
y = numpy.random.standard_normal(npts)

#Set bin numbers in both axes
nxbins = 25
nybins = 25

#Set the cutoff for resolving the individual points
minperbin = 1

#Make the density histrogram
H, yedges, xedges = np.histogram2d(y,x,bins=(nybins,nxbins))
#Reorient the axes
H =  H[::-1]

extent = [xedges[0],xedges[-1],yedges[0],yedges[-1]]

#Compute all bins where the density plot value is below (or equal to) the threshold
lowxleftedges = [[xedges[i] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))] 
lowxrightedges = [[xedges[i+1] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))] 
lowyleftedges = [[yedges[-(j+2)] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))]
lowyrightedges = [[yedges[-(j+1)] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))]

#Flatten and convert to numpy array
lowxleftedges = np.asarray([item for sublist in lowxleftedges for item in sublist])
lowxrightedges = np.asarray([item for sublist in lowxrightedges for item in sublist])
lowyleftedges = np.asarray([item for sublist in lowyleftedges for item in sublist])
lowyrightedges = np.asarray([item for sublist in lowyrightedges for item in sublist])

#Find all points that lie in these regions
lowdatax = [[x[i] for j in range(len(lowxleftedges)) if lowxleftedges[j] <= x[i] and x[i] <= lowxrightedges[j] and lowyleftedges[j] <= y[i] and y[i] <= lowyrightedges[j]] for i in range(len(x))]
lowdatay = [[y[i] for j in range(len(lowyleftedges)) if lowxleftedges[j] <= x[i] and x[i] <= lowxrightedges[j] and lowyleftedges[j] <= y[i] and y[i] <= lowyrightedges[j]] for i in range(len(y))]

#Flatten and convert into numpy array
lowdatax = np.asarray([item for sublist in lowdatax for item in sublist])
lowdatay = np.asarray([item for sublist in lowdatay for item in sublist])

#Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.plot(lowdatax,lowdatay,linestyle='.',marker='o',mfc='k',mec='k')
cp1 = ax1.imshow(H,interpolation='nearest',extent=extent,cmap=halfpurplecmap,vmin=minperbin)
fig1.colorbar(cp1)

fig1.savefig('contourtest.eps')

This code produces an image that looks like this:

countour test

However, when used on larger data sets the program takes several seconds to minutes. Any thoughts on how to speed this up? Thanks!

A few days ago my girlfriend showed me beautiful plots she has made with R's [`smoothScatter`](http://rfunction.com/archives/595) function, which advantageously combines a scatter plot and a density map. I became instantly frustrated that there was no equivalent in matplotlib, so I am glad to find this old question here about it. — juseg, Feb 05 '15 at 13:51

sega_sai · Accepted Answer · 2016-02-01T13:58:15.073

This should do it:

import matplotlib.pyplot as plt, numpy as np, numpy.random, scipy

#histogram definition
xyrange = [[-5,5],[-5,5]] # data range
bins = [100,100] # number of bins
thresh = 3  #density threshold

#data definition
N = 1e5;
xdat, ydat = np.random.normal(size=N), np.random.normal(1, 0.6, size=N)

# histogram the data
hh, locx, locy = scipy.histogram2d(xdat, ydat, range=xyrange, bins=bins)
posx = np.digitize(xdat, locx)
posy = np.digitize(ydat, locy)

#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = xdat[ind][hhsub < thresh] # low density points
ydat1 = ydat[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs

plt.imshow(np.flipud(hh.T),cmap='jet',extent=np.array(xyrange).flatten(), interpolation='none', origin='upper')
plt.colorbar()   
plt.plot(xdat1, ydat1, '.',color='darkblue')
plt.show()

Nice, that's the same idea as my eventual solution but expressed in fewer lines of code. Thanks! — Singularity, May 04 '12 at 18:45
Is there a way of doing the same thing, but with dynamical plot re-scale remaining? For instance in case where standard deviations are very different? — chiffa, Apr 06 '14 at 20:46

score 6 · Answer 2 · answered Feb 05 '15 at 13:50

For the record, here is the result of a new attempt using scipy.stats.gaussian_kde rather than a 2D histogram. One could envision different combinations of color meshing and contouring depending on the purpose.

import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde

# parameters
npts = 5000         # number of sample points
bins = 100          # number of bins in density maps
threshold = 0.01    # density threshold for scatter plot

# initialize figure
fig, ax = plt.subplots()

# create a random dataset
x1, y1 = np.random.multivariate_normal([0, 0], [[1, 0], [0, 1]], npts/2).T
x2, y2 = np.random.multivariate_normal([4, 4], [[4, 0], [0, 1]], npts/2).T
x = np.hstack((x1, x2))
y = np.hstack((y1, y2))
points = np.vstack([x, y])

# perform kernel density estimate
kde = gaussian_kde(points)
z = kde(points)

# mask points above density threshold
x = np.ma.masked_where(z > threshold, x)
y = np.ma.masked_where(z > threshold, y)

# plot unmasked points
ax.scatter(x, y, c='black', marker='.')

# get bounds from axes
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()

# prepare grid for density map
xedges = np.linspace(xmin, xmax, bins)
yedges = np.linspace(ymin, ymax, bins)
xx, yy = np.meshgrid(xedges, yedges)
gridpoints = np.array([xx.ravel(), yy.ravel()])

# compute density map
zz = np.reshape(kde(gridpoints), xx.shape)

# plot density map
im = ax.imshow(zz, cmap='CMRmap_r', interpolation='nearest',
               origin='lower', extent=[xmin, xmax, ymin, ymax])

# plot threshold contour
cs = ax.contour(xx, yy, zz, levels=[threshold], colors='black')

# show
fig.colorbar(im)
plt.show()

Smooth scatter plot

score 2 · Answer 3 · answered May 04 '12 at 04:27

Your problem is quadratic - for npts = 1000, you have array size reaching 10^6 points, and than you iterate over these lists with list comprehensions.
Now, this is a matter of taste of course, but I find that list comprehension can yield a totally code which is hard to follow, and they are only slightly faster sometimes ... but that's not my point.
My point is that for large array operations you have numpy functions like:

np.where, np.choose etc.

See that you can achieve that functionality of the list comprehensions with NumPy, and your code should run faster.

Do I understand correctly, your comment ?

#Find all points that lie in these regions

are you testing for a point inside a polygon ? if so, consider point in polygon inside matplotlib.

score 2 · Answer 4 · answered May 04 '12 at 15:03

After a night to sleep on it and reading through Oz123's suggestions, I figured it out. The trick is to compute which bin each x,y point falls into (xi,yi), then test if H[xi,yi] (actually, in my case H[yi,xi]) is beneath the threshold. The code is below, and runs very fast for large numbers of points and is much cleaner:

import numpy as np
import math
import matplotlib as mpl
import matplotlib.pyplot as plt
import pylab
import numpy.random

#Create the colormap:
halfpurples = {'blue': [(0.0,1.0,1.0),(0.000001, 0.78431373834609985, 0.78431373834609985),
0.25, 0.729411780834198, 0.729411780834198), (0.5,
0.63921570777893066, 0.63921570777893066), (0.75,
0.56078433990478516, 0.56078433990478516), (1.0, 0.49019607901573181,
0.49019607901573181)],

    'green': [(0.0,1.0,1.0),(0.000001,
    0.60392159223556519, 0.60392159223556519), (0.25,
    0.49019607901573181, 0.49019607901573181), (0.5,
    0.31764706969261169, 0.31764706969261169), (0.75,
    0.15294118225574493, 0.15294118225574493), (1.0, 0.0, 0.0)],

    'red': [(0.0,1.0,1.0),(0.000001,
    0.61960786581039429, 0.61960786581039429), (0.25,
    0.50196081399917603, 0.50196081399917603), (0.5,
    0.41568627953529358, 0.41568627953529358), (0.75,
    0.32941177487373352, 0.32941177487373352), (1.0,
    0.24705882370471954, 0.24705882370471954)]} 

halfpurplecmap = mpl.colors.LinearSegmentedColormap('halfpurples',halfpurples,256)

#Create x,y arrays of normally distributed points
npts = 100000
x = numpy.random.standard_normal(npts)
y = numpy.random.standard_normal(npts)

#Set bin numbers in both axes
nxbins = 100
nybins = 100

#Set the cutoff for resolving the individual points
minperbin = 1

#Make the density histrogram
H, yedges, xedges = np.histogram2d(y,x,bins=(nybins,nxbins))
#Reorient the axes
H =  H[::-1]

extent = [xedges[0],xedges[-1],yedges[0],yedges[-1]]

#Figure out which bin each x,y point is in
xbinsize = xedges[1]-xedges[0]
ybinsize = yedges[1]-yedges[0]
xi = ((x-xedges[0])/xbinsize).astype(np.integer)
yi = nybins-1-((y-yedges[0])/ybinsize).astype(np.integer)

#Subtract one from any points exactly on the right and upper edges of the region
xim1 = xi-1
yim1 = yi-1
xi = np.where(xi < nxbins,xi,xim1)
yi = np.where(yi < nybins,yi,yim1)

#Get all points with density below the threshold
lowdensityx = x[H[yi,xi] <= minperbin]
lowdensityy = y[H[yi,xi] <= minperbin]

#Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.plot(lowdensityx,lowdensityy,linestyle='.',marker='o',mfc='k',mec='k',ms=3)
cp1 = ax1.imshow(H,interpolation='nearest',extent=extent,cmap=halfpurplecmap,vmin=minperbin)
fig1.colorbar(cp1)

fig1.savefig('contourtest.eps')

i gave you an upvote for implementing my suggestion :-) try always work with numpy builtins , it's faster than list comprehensions — oz123, May 05 '12 at 18:17

Efficiently create a density plot for high-density regions, points for sparse regions

4 Answers4

Linked