4

I am creating one figure with around one hundred subplots/axes, each with a few thousand data points. Currently, I am looping through each subplot and using plt.scatter to place the points. However, this is quite slow. Is it possible to use multiple CPUs to speed up the plotting, by dividing the labor either one core per subplot or in terms of plotting the data points within a single subplot?

So far, I have attempted using joblib to use parallel processes for the subplot creation, but rather than creating new subplots within the same figure, it spawns a new figure for each subplot. I have tried with the backends PDF, Qt5Agg, and Agg. Here is a simplified example of my code.

import matplotlib as mpl
mpl.use('PDF')
import seaborn as sns
import matplotlib.pyplot as plt
from joblib import Parallel, delayed

def plotter(name, df, ax):
    ax.scatter(df['petal_length'], df['sepal_length'])

iris = sns.load_dataset('iris')
fig, axes = plt.subplots(3,1)

Parallel(n_jobs=2)(delayed(plotter)
    (species_name, species_df, ax)
    for (species_name, species_df), ax in zip(iris.groupby('species'), axes.ravel()))

fig.savefig('test.pdf')

Setting n_jobs=1 works, all points are then plotted within the same figure. However, increasing it to above one creates four figures: one that I initiate with plt.subplots and then one for each time ax.scatter is called.

Since I am passing the axes from the first figure to plotter, I am not sure how/why the additional figures are created. Is there some fallback in matplotlib, that causes new figures to be created automatically if the specified figure is "locked" by another plotting process?

Any advice on how to improve my current approach or achieve the speedups through alternative approaches are appreciated.

joelostblom
  • 43,590
  • 17
  • 150
  • 159

1 Answers1

5

Joblib's parallel uses the multiprocessing module for spawning processes, so each job will run in a different process. That is why you'll get a new figure for each job. The processes don't share any memory, like threads would do, so they don't have access to the original figure.

You could probably try using threads, but it is questionable if you'll get any speed gains, because of the global interpreter lock (GIL).

To speed up the plotting, you could maybe try to avoid using pyplot. It adds some overhead and a helper thread that redraws the plot after each plotting command. This is mostly geared toward making for example ipython feel more like Matlab - but for speed this is bad. If you only use matplotlib then you can select to draw the plot only when you have finished it, and it will probably save some considerable time.

Note: @Faultier mentioned in a comment that you can enable and disable interactive plotting with pyplot.ion() and pyplot.ioff().

J. P. Petersen
  • 4,871
  • 4
  • 33
  • 33
  • 2
    It might be most practicable to create separate figures, save them temporarily and finally load them into a combined figure? For speed gain `plt.ioff()` also helps, as the auto redraw is avoided. – Faultier Jan 03 '17 at 10:40
  • @Faultier @J.P.Petersen Thanks! I am already using `plt.ioff`(not included in the example in the question, sorry), I never display the figure, just create it and save as a pdf. Would I still have significant speed gains from using `matplotlib` directly and avoid `pyplot` altogether? – joelostblom Jan 03 '17 at 14:14
  • @Faultier What do you mean with creating separate figures and combining them? From what [I have read](http://stackoverflow.com/questions/6309472/matplotlib-can-i-create-axessubplot-objects-then-add-them-to-a-figure-instance?noredirect=1&lq=1), it is cumbersome (if at all possible) and not officially supported to create matplotlib axes separately and combine them in a figure. Are you referring saving separate PDFs and then stitching them together? I am considering this, but not sure which is the best crossplatform (unfortunately necessary for me) python library to implement the pdf stitching. – joelostblom Jan 03 '17 at 14:19
  • @J.P.Petersen I am still not 100% clear on why the different figures are created. I understand that the jobs are running in different processes, and for that means that if something is created in one process, it would not be accessible from the other simultaneously running processes. However, I am creating the figure before spawning the processes, and passing an existing axis to each process. Shouldn't they be able to access this specific axis which was created before the process was spawned? – joelostblom Jan 03 '17 at 14:23
  • I think figures are created because axes cannot exist without a figure (as you stated). Each process will therefore also copy the figure with the axis, yielding you many figures (this is what I think, not know). You could create separate figures and save them as a png or whatever you can reuse with `imshow()`; stitching pictures on a `gridspec()` or `subplots()` should work and then save it to pdf. Not the most pretty solution but hey, whatever works (; – Faultier Jan 03 '17 at 14:44
  • 1
    @cheflo each child process will have a copy of the figure axes and all the other variables just before the spawn - but if there is done any modifications to the memory in general, it will only happen for the child process. This is normally called Copy On Write (https://en.wikipedia.org/wiki/Copy-on-write). You could try to return the axes from `plotter()` function to the parent process, but I doubt that it will work. Matplotlib states that an axes can only belong to one figure. – J. P. Petersen Jan 03 '17 at 15:17
  • @J.P.Petersen Ah ok, I see. I didn't know about the concept of CopyOnWrite, thanks for linking that. I did already try returning the axis, but as you mentioned, `matplotlib` does not provide a robust solution for "stitching" axes together in a new figure. – joelostblom Jan 03 '17 at 15:24
  • @Faultier I am not sure what you mean with "stitching pictures on a `gridspec()` or `subplots()`. It seems like [matplotlib figures can share axes](http://stackoverflow.com/questions/6309472/matplotlib-can-i-create-axessubplot-objects-then-add-them-to-a-figure-instance?noredirect=1&lq=1), so this would not be possible once loaded with `imshow()`. – joelostblom Jan 03 '17 at 15:57
  • 2
    1. create each of your subplots and save them as a png (this can be parallelized). 2a. create an empty figure with the layout you initially wanted to have. 3a. fill that figure using `plt.imread()` and `plt.imshow()` 2/3b. alternativley, put the pictures into a LaTeX table and create a PDF from there. 4. redo step 1 till you are OK with how it looks (this is very likely painful) – Faultier Jan 03 '17 at 16:29