0

For a deep learning project, I need to synthesize plots for each item in my dataset. This means generating 2.5 million plots, each 224x224 pixels.

So far the best I've been able to do is this, which takes 2.7 seconds to run on my PC:

from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import matplotlib.pyplot as plt

for i in range(100):
    fig = plt.Figure(frameon=False, facecolor="white", figsize=(4, 4))
    ax = fig.add_subplot(111)
    ax.axis('off')
    ax.plot([1, 2, 3, 4, 5, 6, 7, 8], [2, 4, 6, 8, 8, 6, 4, 3])
    canvas = FigureCanvas(fig)
    canvas.print_figure(str(i), dpi=56)

A resulting image (from this reproducible example) looks like this:

enter image description here

The real images use a bit more data (200 rows) but that makes little difference to speed.

At the speed above it will take me around 18 hours to generate all my plots! Are there any clever ways to speed this up?

Alex Lach
  • 123
  • 7
  • 1
    Could you show the result image for your example? It's not quite obvious how 2x8 values can yield 224x224 px. – AKX Aug 08 '22 at 19:49
  • I've added an example plot from the reproducible example code. The real data is a bit bigger (200 points) but that doesn't seem to affect the time much. – Alex Lach Aug 08 '22 at 19:57
  • 3
    Right - I think using [Pillow's `ImageDraw.line()`](https://pillow.readthedocs.io/en/stable/reference/ImageDraw.html) to draw polylines would probably be much faster than all of the magic MPL has to do. – AKX Aug 08 '22 at 20:01
  • I'll give that a try and report back! – Alex Lach Aug 08 '22 at 20:18
  • If you are really writing that many files to disk, on a regular basis, you will almost certainly benefit from 1) multiprocessing and 2) an NVME disk to sustain fast i/o. – Mark Setchell Aug 08 '22 at 20:51
  • The advice from @AKX to use Pillow made things about 6x faster. I've added an answer based off that, but won't accept for now in case anybody else posts. – Alex Lach Aug 08 '22 at 20:52
  • I'll try multithreading next. I'll have to do some reading up on the best options for that. – Alex Lach Aug 08 '22 at 20:52
  • You might have a look here for some related ideas https://stackoverflow.com/a/51822265/2836621 – Mark Setchell Aug 08 '22 at 20:54
  • After you have 2,500,000 plots, then what? Are you going to look at all of them? Stitch them into a video? At 60 frames/sec, that'd still be a 11.6-hr long video – Paul H Aug 08 '22 at 23:35

1 Answers1

0

Per the comment from AKX, Pillow has a function ImageDraw.line() that performs faster for this task:

from PIL import Image, ImageDraw
from itertools import chain
scale = 224
pad = 5
scale_pad = scale - pad * 2
for i in range(200):
    im = Image.new('RGB', (scale, scale), (255, 255, 255)) 
    draw = ImageDraw.Draw(im) 
    x = [1, 2, 3, 4, 5, 6, 7, 8]
    y = [2, 4, 6, 8, 8, 6, 4, 3]
    x = [pad + (i - min(x)) / (max(x) - min(x)) * scale_pad for i in x]
    y = [pad + (i - min(y)) / (max(y) - min(y)) * scale_pad  for i in y]

    draw.line(list(chain.from_iterable(zip(x, y))), fill=(0, 0, 0), width=4)
    im.save(f"{i}.png")

This performs about 6x faster than Matplotlib, meaning my task should take only ~3 hours instead of 18.

Alex Lach
  • 123
  • 7
  • 1
    You *might* find it faster to empty your existing image ( e.g. by filling with black) than to create a new one and a new drawing context at each iteration. – Mark Setchell Aug 08 '22 at 20:57
  • 1
    You *may* find it faster to save as JPEG than as PNG, as long as it doesn't upset your downstream processing. – Mark Setchell Aug 08 '22 at 20:58
  • 2
    You can use `Image.paste(0, box=(0,0,w,h))` to fill with black, by the way. Also, if your image has fewer than 256 colours (seems to be black and white), you could try creating it in palette mode, i.e. `P` rather than `RGB`. They will likely take less space in RAM and on disk, which may become important if multiprocessing. Again, as long as it doesn't affect downstream processing. – Mark Setchell Aug 08 '22 at 21:27