2

I wrote some code a while ago that used gaussian kde to make simple density scatter plots. However, for datasets larger than about 100,000 points, it just ran 'forever' (I killed it after a few days). A friend gave me some code in R that could create such a density plot in seconds (plot_fun.R), and it seems like matplotlib should be able to do the same thing.

I think the right place to look is 2d histograms, but I am struggling to get the density to be 'right'. I modified code I found at this question to accomplish this, but the density is not showing, it looks like only the densist posible points are getting any color.

Here is approximately the code I am using:

# initial data
x = -np.log10(np.random.random_sample(10000))
y = -np.log10(np.random.random_sample(10000))

#histogram definition
bins = [1000, 1000] # number of bins
thresh = 3  #density threshold

#data definition
mn = min(x.min(), y.min())
mx = max(x.max(), y.max())
mn = mn-(mn*.1)
mx = mx+(mx*.1)
xyrange = [[mn, mx], [mn, mx]]

# histogram the data
hh, locx, locy = np.histogram2d(x, y, range=xyrange, bins=bins)
posx = np.digitize(x, locx)
posy = np.digitize(y, locy)

#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = x[ind][hhsub < thresh] # low density points
ydat1 = y[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs

f, a = plt.subplots(figsize=(12,12))

c = a.imshow(
    np.flipud(hh.T), cmap='jet',
    extent=np.array(xyrange).flatten(), interpolation='none',
    origin='upper'
)
f.colorbar(c, ax=ax, orientation='vertical', shrink=0.75, pad=0.05)
s = a.scatter(
    xdat1, ydat1, color='darkblue', edgecolor='', label=None,
    picker=True, zorder=2
) 

That produces this plot:

bad plot

The KDE code is here:

f, a = plt.subplots(figsize=(12,12))
xy = np.vstack([x, y])
z = sts.gaussian_kde(xy)(xy)                                           
# Sort the points by density, so that the densest points are       
# plotted last                                                     
idx = z.argsort()                                                  
x2, y2, z = x[idx], y[idx], z[idx]                                 
s = a.scatter(                                                     
    x2, y2, c=z, s=50, cmap='jet',
    edgecolor='', label=None, picker=True, zorder=2                
)   

That produces this plot:

good plot

The problem is, of course, that this code is unusable on large data sets.

My question is: how can I use the 2d histogram to produce a scatter plot like that? ax.hist2d does not produce a useful output, because it colors the whole plot, and all my efforts to get the above 2d histogram data to actually color the dense regions of the plot correctly have failed, I always end up with either no coloring or a tiny percentage of the densest points being colored. Clearly I just don't understand the code very well.

Mike D
  • 727
  • 2
  • 10
  • 26
  • Using smaller marker `','` along with transparency `alpha=0.1` (or any appropriate value) will give you similar visual info and should be very fast – Julien Apr 05 '18 at 02:05
  • I did, that is the first plot and set of code. My question is basically why it doesn't work, as you can see in the first plot, there is no density shading. – Mike D Apr 05 '18 at 17:07

1 Answers1

2

Your histogram code assigns a unique color (color='darkblue') so what are you expecting? I think you are also over complicating things. This much simpler code works fine:

import numpy as np
import matplotlib.pyplot as plt

x, y = -np.log10(np.random.random_sample((2,10**6)))

#histogram definition
bins = [1000, 1000] # number of bins

# histogram the data
hh, locx, locy = np.histogram2d(x, y, bins=bins)

# Sort the points by density, so that the densest points are plotted last
z = np.array([hh[np.argmax(a<=locx[1:]),np.argmax(b<=locy[1:])] for a,b in zip(x,y)])
idx = z.argsort()
x2, y2, z2 = x[idx], y[idx], z[idx]

plt.figure(1,figsize=(8,8)).clf()
s = plt.scatter(x2, y2, c=z2, cmap='jet', marker='.')  
Julien
  • 13,986
  • 5
  • 29
  • 53
  • Thanks. Unfortunately for me, this still results in only the densist points being colored, rather than a smooth gradient from most to least dense. Also, just FYI, the 'darkblue' color was for the scatter plot only, the histogram was added as an image before that with the jet cmap. – Mike D Apr 06 '18 at 17:16
  • From playing a little more I think that the histogram just isn't going to work, it doesn't seem to accurately portray density – Mike D Apr 06 '18 at 17:22
  • Just to make it clear, here is the result of the R code on real data (about 2 minutes run time): https://imgur.com/a/S6NZC, here is the result of the code you shared on the same data (similar run time): https://imgur.com/a/6RsdP. Because so many points are in a tiny region of the graph, nothing else is colored. – Mike D Apr 06 '18 at 17:54
  • @MikeDacre, I am 2 years late but if you want to see the distribution of data that is too dense, you should transform the scale to log. – aerijman Mar 21 '23 at 10:08