Scatter plot with a huge amount of data

Question

I would like to use Matplotlib to generate a scatter plot with a huge amount of data (about 3 million points). Actually I've 3 vectors with the same dimension and I use to plot in the following way.

import matplotlib.pyplot as plt
import numpy as np
from numpy import *
from matplotlib import rc
import pylab
from pylab import * 
fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)
plt.scatter(delta,vf,c=dS,alpha=0.7,cmap=cm.Paired)

Nothing special actually. But it takes too long to generate it actually (I'm working on my MacBook Pro 4 GB RAM with Python 2.7 and Matplotlib 1.0). Is there any way to improve the speed?

Beyond tens of thousands of points, some form of raster graphing might be preferable both for speed and actual usability. — Nick T, Nov 03 '10 at 15:57

unutbu · Answer 1 · 2010-11-03T02:15:05.087

Unless your graphic is huge, many of those 3 million points are going to overlap. (A 400x600 image only has 240K dots...)

So the easiest thing to do would be to take a sample of say, 1000 points, from your data:

import random
delta_sample=random.sample(delta,1000)

and just plot that.

For example:

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import random

fig = plt.figure()
fig.subplots_adjust(bottom=0.2)
ax = fig.add_subplot(111)

N=3*10**6
delta=np.random.normal(size=N)
vf=np.random.normal(size=N)
dS=np.random.normal(size=N)

idx=random.sample(range(N),1000)

plt.scatter(delta[idx],vf[idx],c=dS[idx],alpha=0.7,cmap=cm.Paired)
plt.show()

alt text

Or, if you need to pay more attention to outliers, then perhaps you could bin your data using np.histogram, and then compose a delta_sample which has representatives from each bin.

Unfortunately, when using np.histogram I don't think there is any easy way to associate bins with individual data points. A simple, but approximate solution is to use the location of a point in or on the bin edge itself as a proxy for the points in it:

xedges=np.linspace(-10,10,100)
yedges=np.linspace(-10,10,100)
zedges=np.linspace(-10,10,10)
hist,edges=np.histogramdd((delta,vf,dS), (xedges,yedges,zedges))
xidx,yidx,zidx=np.where(hist>0)
plt.scatter(xedges[xidx],yedges[yidx],c=zedges[zidx],alpha=0.7,cmap=cm.Paired)
plt.show()

alt text

To complete the solution,if you were to sample randomly do it N times to get the whole picture of the situation. — Dat Chu, Nov 02 '10 at 21:58
Actually I also think that binning data could be the easiest way. Can you please suggest me how to make it (conserving the correspondence betweens the bins of the three vectors? I mean a sort of 3D histogram) — Nicola Vianello, Nov 02 '10 at 22:29
thank you very much. Actually I think I did not explain my self correctly. I would like to create a colormap so that the colors indicates the average values of variables z in the bin (xbin,ybin) in order to plot it as an imshow. This is different I think with respect to np.histogramdd. Maybe someone could help me — Nicola Vianello, Nov 03 '10 at 13:18

score 12 · Answer 2 · answered Nov 10 '10 at 16:14

12

What about trying pyplot.hexbin? It generates a sort of heatmap based on point density in a set number of bins.

answered Nov 10 '10 at 16:14

conjectures

801
2
7
24

score 9 · Accepted Answer · edited May 23 '17 at 11:47

9

You could take the heatmap approach shown here. In this example the color represents the quantity of data in the bin, not the median value of the dS array, but that should be easy to change. More later if you are interested.

edited May 23 '17 at 11:47

Community

1
1

answered Nov 03 '10 at 15:51

Paul

42,322
15
106
123

2

but, the heat map is not a good idea for anomaly detection using scatter plot. – Ch HaXam Feb 02 '18 at 09:59
@ChHaXam Good point. You can, however, overlay a scatter plot (of outliers) on top of the heat map and get the best of both. – Paul Feb 03 '18 at 02:17

Scatter plot with a huge amount of data

3 Answers3

Linked