17

I've read here that matplotlib is good at handling large data sets. I'm writing a data processing application and have embedded matplotlib plots into wx and have found matplotlib to be TERRIBLE at handling large amounts of data, both in terms of speed and in terms of memory. Does anyone know a way to speed up (reduce memory footprint of) matplotlib other than downsampling your inputs?

To illustrate how bad matplotlib is with memory consider this code:

import pylab
import numpy
a = numpy.arange(int(1e7)) # only 10,000,000 32-bit integers (~40 Mb in memory)
# watch your system memory now...
pylab.plot(a) # this uses over 230 ADDITIONAL Mb of memory
Community
  • 1
  • 1
David Morton
  • 1,744
  • 2
  • 15
  • 20
  • 8
    I've always downsampled. Why would you ever need to try to render 10M points on a graph? – Paul Feb 12 '11 at 04:34
  • 1
    matplotlib is slow. It is a known fact. For qt i use the guiqwt package, maybe there is something like it for wx too. – tillsten Feb 12 '11 at 15:59
  • 2
    @paul I just wanted to make it easy for my users to explore the data graphically. i.e. when they zoom, I didn't want to have to resample again depending on their zoom bounds, they would see the actual data no matter how they zoomed/panned. – David Morton Feb 12 '11 at 18:53
  • If it's feasible, try not plotting things with lines connecting them... `plt.plot(a, 'b.')` will be _much_ faster than the default `plt.plot(a, 'b-')`. – Joe Kington Feb 12 '11 at 20:23
  • Try turning anti-aliasing off. – Paul Feb 12 '11 at 22:49
  • I can understand wanting to zoom the full extent of the data. I had the same problem with huge datasets. I ended up adding a button that allowed the user to trigger the resampling at their current zoom level. Today I would probably look into ways of automating the trigger by having wx read the zoom level. If you end up plotting points instead of lines like Joe suggests, you may be able to get away with adding new, finer-sampled collections over the old (same color of course). – Paul Feb 12 '11 at 23:04
  • 3
    @Joe Kington My tests do not show dots to be faster or less memory intensive than lines. :( – David Morton Feb 13 '11 at 00:34
  • @David - Hmm... You're quite right... In fact, using dots seems to be less responsive... I remembered quite the opposite, but perhaps that was only true for some earlier version of matplotlib. At any rate, matplotlib deliberately keeps multiple (transformed) copies of the original data around, so if you need something more memory-efficient, I'll second looking into `guiqwt` (it's qt-based, though). It's less flexible than matplotlib, but much more lightweight, and still _very_ slick. – Joe Kington Feb 13 '11 at 16:49
  • "Terrible" at handling large amounts of data, compared to what? Another plotting package that downsamples automatically? – Pete May 02 '11 at 14:38
  • I am working with a set of many (e.g. 4000) small line segments drawn over an image, and It gets very, very slow. And I don't want to downsample, I'm not just presenting some statistical thing, I want to look at these lines. I really wish it was faster. I must probably move into some kind of Qt+OpenGL solution. – dividebyzero Apr 11 '12 at 18:15

3 Answers3

7

Downsampling is a good solution here -- plotting 10M points consumes a bunch of memory and time in matplotlib. If you know how much memory is acceptable, then you can downsample based on that amount. For example, let's say 1M points takes 23 additional MB of memory and you find it to be acceptable in terms of space and time, therefore you should downsample so that it's always below the 1M points:

if(len(a) > 1M):
   a = scipy.signal.decimate(a, int(len(a)/1M)+1)
pylab.plot(a)

Or something like the above snippet (the above may downsample too aggressively for your taste.)

brandx
  • 1,053
  • 12
  • 10
  • 2
    A simple decimation is inadequate, and is what Matplotlib does internally so far as I can tell. The reason I don't simply want to decimate, is that you lose the extreme values in each decimation interval. If the signal were to have a sharp spike within an interval you wouldn't see it on the plot at all unless you were very lucky with the intervals. I wrote some code that does this more intelligently, taking the extreme values for each decimation interval instead of the value at the center of the interval (or edge). I'm accepting your answer though since this is in principal what I did. – David Morton Jul 06 '11 at 18:09
  • 5
    David - if you solved this 'more intelligently' would you mind sharing? You can mark your own answers as 'solved' and may get a few up votes... – danodonovan Sep 25 '12 at 15:22
2

I'm often interested in the extreme values too so, before plotting large chunks of data, I proceed in this way:

import numpy as np

s = np.random.normal(size=(1e7,))
decimation_factor = 10 
s = np.max(s.reshape(-1,decimation_factor),axis=1)

# To check the final size
s.shape

Of course np.max is just an example of extreme calculation function.

P.S. With numpy "strides tricks" it should be possible to avoid copying data around during reshape.

Eraldo P.
  • 53
  • 5
2

I was interested in preserving one side of a log sampled plot so I came up with this: (downsample being my first attempt)

def downsample(x, y, target_length=1000, preserve_ends=0):
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    data = np.vstack((x, y))
    if preserve_ends > 0:
        l, data, r = np.split(data, (preserve_ends, -preserve_ends), axis=1)
    interval = int(data.shape[1] / target_length) + 1
    data = data[:, ::interval]
    if preserve_ends > 0:
        data = np.concatenate([l, data, r], axis=1)
    return data[0, :], data[1, :]

def geom_ind(stop, num=50):
    geo_num = num
    ind = np.geomspace(1, stop, dtype=int, num=geo_num)
    while len(set(ind)) < num - 1:
        geo_num += 1
        ind = np.geomspace(1, stop, dtype=int, num=geo_num)
    return np.sort(list(set(ind) | {0}))

def log_downsample(x, y, target_length=1000, flip=False):
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    data = np.vstack((x, y))
    if flip:
        data = np.fliplr(data)
    data = data[:, geom_ind(data.shape[1], num=target_length)]
    if flip:
        data = np.fliplr(data)
    return data[0, :], data[1, :]

which allowed me to better preserve one side of plot:

newx, newy = downsample(x, y, target_length=1000, preserve_ends=50)
newlogx, newlogy = log_downsample(x, y, target_length=1000)
f = plt.figure()
plt.gca().set_yscale("log")
plt.step(x, y, label="original")
plt.step(newx, newy, label="downsample")
plt.step(newlogx, newlogy, label="log_downsample")
plt.legend()

test

Marvin Thielk
  • 101
  • 2
  • 8