11

I have a dataset which contains attribute x, y and they can be plotted in x-y surface.

Originally, I use the code

df.plot(kind='scatter', x='x', y='y', alpha=0.10, s=2)
plt.gca().set_aspect('equal')

The code is pretty quick with data size about 50000.

Recently, I use a newer dataset, with size about 2500000. And the scatter plot becomes much slower.

I want to know, if it's an expected behavior, and if there is anything I can do to improve the plot speed?

ZK Zhao
  • 19,885
  • 47
  • 132
  • 206
  • 3
    It's obviously at least linear in regards to the number of points. Depending on your marker-size, display and dpi i can't imagine plotting so many points make any sense at all. So a natural approach: down-sample your huge data (plot only 10%, randomly selected). This might be even more important when you start to output your plot in vector-graphics based formats. **Edit:** i'm very sure the usage of alpha is also making this very slow. If you want to emulate some density-plot (by the usage of alphas / shading), there are of course better approaches, but there is not enough information here – sascha Mar 07 '17 at 02:36
  • Another suggestion is that you can create a pixel-based picture can draw element. This significantly reduces the memory consumption and speed. – Yifan Sun Mar 07 '17 at 03:01
  • 2
    Is the question really *"Hey, I've increased the number of points by a factor 50 and my plotting speed suddenly is 50 times slower. Can anyone tell me the reason?"* ? – ImportanceOfBeingErnest Mar 07 '17 at 08:01
  • 1
    @ImportanceOfBeingErnest I think so, and I want to know if it is possible to speed it up? – ZK Zhao Mar 07 '17 at 08:12

2 Answers2

5

Yes, it is. The reason for that is that a scatterplot of more than maybe a thousand points makes very little sense, so no one bothered to optimise it. You will be better off using some other representation for your data:

  • A heatmap if your points are distributed all over the place. Make heatmap cells pretty small
  • Draw some sort of a curve that approximates a distribution, maybe correlate your y with your x. Be sure to provide some confidence values or describe a distribution in other way; for me, for instance, building a box-with-whiskers of y for every x (or a range of x) and placing them on the same grid usually works pretty well.
  • Reduce your dataset. @sascha in comments suggests random sampling, and that's definitely a good idea. Depending on your data, maybe there is a better way to choose representative points.
Synedraacus
  • 975
  • 1
  • 8
  • 21
  • Good recommendations. In regards to heatmap/grid-based approaches, [this discussion @ SO is also worthy](http://stackoverflow.com/questions/7470288/matplotlib-pcolor-very-slow-alternatives). – sascha Mar 07 '17 at 02:59
  • Yes. I also make other plot such as density plot and others. It's just that when exploring the data, you want to plot the data in a lot of ways. – ZK Zhao Mar 07 '17 at 07:31
  • how would you suggest plotting something like a map where walls are all given by, lets say, 1000000 ladar data points? – Kyle Sep 14 '17 at 07:20
  • 1
    For an extremely crude solution I'd just create a numpy 2D array full of zeroes at the desired resolution and increment a value by one everywhere the wall is detected. Then I'd convert it into a grayscale image and display with something like `imshow` and/or export to image file. This way, you use less than one pixel (and one numpy `int`) per data point. Surely whatever objects MPL uses in scatterplots are much heavier than `int` increments. – Synedraacus Sep 15 '17 at 08:05
  • 1
    But I'm pretty sure there are libraries developed and optimised specifically for the ladar data (I guess you mean the optical radar thing?) Maybe you can use those to efficiently plot your data? – Synedraacus Sep 15 '17 at 08:10
  • 1
    "yes it is" is a bit ambiguous as an opening. – Mad Physicist Oct 22 '18 at 02:18
  • It makes sense if you're zooming in to look at the details in the interactive plot :( – endolith May 10 '23 at 20:02
3

I had same problem with more than 300k 2D coordinates from a dimension reduction algorithm and the solution was be approximate that coordinates into a 2D numpy array and visualize it as an image. The result was pretty good and also much faster:

def plot_to_buf(data, height=2800, width=2800, inc=0.3):
    xlims = (data[:,0].min(), data[:,0].max())
    ylims = (data[:,1].min(), data[:,1].max())
    dxl = xlims[1] - xlims[0]
    dyl = ylims[1] - ylims[0]

    print('xlims: (%f, %f)' % xlims)
    print('ylims: (%f, %f)' % ylims)

    buffer = np.zeros((height+1, width+1))
    for i, p in enumerate(data):
        print('\rloading: %03d' % (float(i)/data.shape[0]*100), end=' ')
        x0 = int(round(((p[0] - xlims[0]) / dxl) * width))
        y0 = int(round((1 - (p[1] - ylims[0]) / dyl) * height))
        buffer[y0, x0] += inc
        if buffer[y0, x0] > 1.0: buffer[y0, x0] = 1.0
    return xlims, ylims, buffer

data = load_data() # data.shape = (310216, 2) <<< your data here
xlims, ylims, I = plot_to_buf(data, height=h, width=w, inc=0.3)
ax_extent = list(xlims)+list(ylims)
plt.imshow(I,
           vmin=0,
           vmax=1, 
           cmap=plt.get_cmap('hot'),
           interpolation='lanczos',
           aspect='auto',
           extent=ax_extent
           )
plt.grid(alpha=0.2)
plt.title('Latent space')
plt.colorbar()

here is the result:

I hope this helps you.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Dmitry
  • 93
  • 1
  • 7
  • This can be easily vectorized: x0 = np.round(width * (data[:, 0] - xlims[0])/dxl).astype(np.int32) y0 = np.round(1 - (data[:, 1] - ylims[0]) / dyl * height).astype(np.int32) – Dylan Madisetti May 05 '23 at 16:08