11

I would like to use matplotlib to generate a number of PDF files. My main problem is that matplotlib is slow, taking order of 0.5 seconds per file.

I tried to figure out why it takes so long, and I wrote the following test program that just plots a very simple curve as a PDF file:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

X = range(10)
Y = [ x**2 for x in X ]

for n in range(100):
    fig = plt.figure(figsize=(6,6))
    ax = fig.add_subplot(111)
    ax.plot(X, Y)
    fig.savefig("test.pdf")

But even something as simple as this takes a lot of time: 15–20 second in total for 100 PDF files (modern Intel platforms, I tried both Mac OS X and Linux systems).

Are there any tricks and techniques that I can use to speed up PDF generation in matplotlib? Obviously I can use multiple parallel threads on multi-core platforms, but is there anything else that I can do?

Jukka Suomela
  • 12,070
  • 6
  • 40
  • 46
  • 1
    As you have already seen, I don't have an answer, but I do have appropriate references. There is a development ticket here: https://github.com/matplotlib/matplotlib/issues/992, and a mailing list question archived here: http://sourceforge.net/mailarchive/forum.php?thread_name=4FF926B8.1030202%40hawaii.edu&forum_name=matplotlib-users – pelson Aug 19 '12 at 16:58
  • I tried to create a pdf-file (using arrays for plotting) with a simple plot and it took me 72ns to create a file. do you depend on the lists which you created over here? if not I could post my solution. – ahelm Aug 20 '12 at 11:32
  • @PateToni: The input format is irrelevant here, data conversion is *much* faster than plotting. :) – Jukka Suomela Aug 20 '12 at 11:35
  • @JukkaSuomela: Sorry, but I found out, that my version of python on a Windows machine is kinda broken. It doesn't show me the right timing. The 72ns aren't true. I don't got any speed-up on my notebook. The bottle neck is inside matplotlib based on profiling. Just try some alternatives (PyCha, ...) or search for a faster machine to work on =) – ahelm Aug 20 '12 at 13:59
  • Is this http://stackoverflow.com/questions/4690585/is-there-a-matplotlib-flowable-for-reportlab the answer you're looking for? – Drake Guan Aug 30 '12 at 03:41
  • @Drake: I am sorry, I do not see how it is related to my question? – Jukka Suomela Aug 30 '12 at 09:28
  • What is the size of the generated PDF ? isn't there more points plotted than what you really need ? – bokan Sep 01 '12 at 00:43
  • @bokan: In the above example, the size of the PDF file is less than 7 kilobytes, and the number of points in the plot is 10. – Jukka Suomela Sep 01 '12 at 00:48

4 Answers4

4

If its practical, you could use multiprocess to do this (assuming you have multiple cores on your machine):

NOTE: The following code will produce 40 pdfs in the present directory on your machine

import matplotlib.pyplot as plt

import multiprocessing


def do_plot(y_pos):
    fig = plt.figure()
    ax = plt.axes()
    ax.axhline(y_pos)
    fig.savefig('%s.pdf' % y_pos)

pool = multiprocessing.Pool()

for i in xrange(40):
    pool.apply_async(do_plot, [i])

pool.close()
pool.join()

It doesn't scale perfectly, but I get a significant boost by doing this on my 4 cores (dual-core with hypertheading):

$> time python multi_pool_1.py 
done

real    0m5.218s
user    0m4.901s
sys 0m0.205s

$> time python multi_pool_n.py 
done

real    0m2.935s
user    0m9.022s
sys 0m0.420s

I'm sure there is a lot of scope for performance improvements on the pdf backend of mpl, but that is not on the timescale you are after.

HTH,

pelson
  • 21,252
  • 4
  • 92
  • 99
  • Unfortunately this does not really answer my question. I wrote: "Obviously I can use multiple parallel threads on multi-core platforms, but is there anything else that I can do?" – Jukka Suomela Aug 19 '12 at 15:48
  • Sometimes you have to be pragmatic. Use of reportlab isn't an answer to your question "Are there any tricks and techniques that I can use to speed up PDF generation in **matplotlib**?" either, but it is a good suggestion for some cases. – pelson Aug 19 '12 at 16:14
  • Please don't get me wrong; using multicore computers (and more generally, cluster environments) is a very good way to speed up computation, especially if the computations are trivial to parallelise, as is the case here. However, this is something that I already know and something that I am doing already; my question was about other possible ways to speed up matplotlib. The number of parallel cores is, after all, still somewhat limited, and cluster environments have their overheads. And a single core running at > 2GHz *should* be able to generate more than 5 simple PDF figures per second. :) – Jukka Suomela Aug 19 '12 at 16:25
  • In the end, abusing raw computer power and parallel threads was the only approach that gave substantial speedups. In my real application, with one particular test data set, the original running time with my own computer was **63** seconds. A high-end server improved it to 29 seconds with a single CPU, and with multiprocessing it was only 6 seconds (8 cores + hyperthreading). Using multiple servers in parallel squeezed it to **4** seconds, and now the bottleneck is already in other parts of the application, completely unrelated to matplotlib. A horrible overkill, but I am happy now. :) – Jukka Suomela Sep 01 '12 at 10:54
3

Matplotlib has a lot of overhead for creation of the figure, etc. even before saving it to pdf. So if your plots are similar you can safe a lot of "setting up" by reusing elements, just like you will find in animation examples for matplotlib.

You can reuse the figure and axes in this example:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

X = range(10)
Y = [ x**2 for x in X ]
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)


for n in range(100):
    ax.clear() # or even better just line.remove()
               # but should interfere with autoscaling see also below about that
    line = ax.plot(X, Y)[0]
    fig.savefig("test.pdf")

Note that this does not help that much. You can save quite a bit more, by reusing the lines:

line = ax.plot(X, Y)[0]
for n in range(100):
    # Now instead of plotting, we update the current line:
    line.set_xdata(X)
    line.set_ydata(Y)
    # If autoscaling is necessary:
    ax.relim()
    ax.autoscale()

    fig.savefig("test.pdf")

This is close to twice as fast as the initial example for me. This is only an option if you do similar plots, but if they are very similar, it can speed up things a lot. The matplotlib animation examples may have inspiration for this kind of optimization.

seberg
  • 8,785
  • 2
  • 31
  • 30
  • Thanks, these are good ideas. Unfortunately, in my real application the actual `savefig` calls seem to take much more than 50 % of the total running time. – Jukka Suomela Aug 26 '12 at 15:48
  • @JukkaSuomela, true, this shouldn't make much of a difference for savefig itself. Only idea I got would be trying another backend, but I doubt it makes a big difference... – seberg Aug 26 '12 at 16:09
  • It seems that the 'pdf' backend is as slow as 'Agg'. It would be a bit faster to use the 'ps' backend to generate PostScript files (but then I would need to convert PS to PDF). – Jukka Suomela Aug 26 '12 at 16:14
  • @JukkaSuomela try cairo.pdf maybe? The result might look a little different though I guess. – seberg Aug 26 '12 at 16:15
  • cairo.pdf seems to be slightly faster but not much. – Jukka Suomela Aug 26 '12 at 16:21
0

You could use Report Lab. The open source version should be enough to do what you are trying to do. It should be a lot faster than using matplotlib to generate the pdfs.

BigHandsome
  • 4,843
  • 5
  • 23
  • 30
  • Can I easily translate my existing code that uses matplotlib to use Report Lab? Is the quality of the output on a par with matplotlib? – Jukka Suomela Aug 19 '12 at 13:21
  • The coding is going to be a little more complex. You could create a jpg and then embed it into the pdf. [2.1.3 Can I use any images?](http://www.reportlab.com/software/opensource/rl-toolkit/faq/#2.1.3). Or, you could also rewrite it to use Reportlab Platypus. Either should be faster because matplotlib image generation is faster than pdf generation. Or, one of the researchers where I work uses [pdfrw](http://code.google.com/p/pdfrw/), and swears by it. I believe the reason that matplotlib is so slow in generating pdfs is becuase it converts them to latex and then pdf using sphinx. – BigHandsome Aug 19 '12 at 13:37
0

I assume that changing the library (matplotlib) is not an option for you, because you actually like what matplotlib produces :-). I also assume -- and some people here have already commented on this -- that other backends for matplotlib are not significantly faster. I think in these days of many cores per machine and operating systems with good task schedulers it is just fine to run jobs like yours in parallel in order to optimize the throughput, i.e. the rate of PDF file creation. I think you'll manage to produce lots of files per second with a reasonable amount of computing power. This is the way to go, so I honestly believe that your question is very interesting, but not really relevant in practice.

Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130