3

I have a library of large images (8000x6000px ~13mb) for which I would like to generate multiple thumbnails of smaller sizes with widths of 3000px, 2000px, 1000px, 500px, 250px, and 100px.

The source image is stored in a flat file, and the generated thumbnails will also be stored in flat files.

I've been thinking about the optimal way to do this in Python, and these are potential issues that immediately come to mind:

  • Would it make sense to generate each thumbnail from source image, or can I create the smaller thumbnails from any thumbnail that is slightly larger? E.g., 8000px -> 3000px, 3000px --> 2000px, 1000px -> 500px, etc... Wouldn't that run much faster?
  • Does it make sense to load the source image into memory before generating the thumbnails?
  • Should I use ImageMagick? From the command line, or via API?
  • Any way to tap into the GPU?
  • Would multiple threads make sense in this case?

Are there other things to keep in mind when optimizing thumbnail generation? Sample code is greatly appreciated to get started. Thank you.

ensnare
  • 40,069
  • 64
  • 158
  • 224
  • What do you mean by *"the image is stored in a flat file"*? Do you mean a `JPEG`, or `TIFF` or somesuch? And what would be a bumpy file? – Mark Setchell Aug 22 '15 at 09:50
  • Late to answer this, but 'flat file' usually refers to an OS file-system file - as opposed to being stored in structured storage like a database etc. – Webreaper Jun 20 '18 at 15:49

1 Answers1

5

I made some images and did some tests so you can see the effect on the performance of various techniques.

I made the images to contain random, difficult-to-compress data at dimensions and filesizes to match yours, i.e.

convert -size 8000x6000 xc:gray +noise random -quality 35 image.jpg

then, ls gives 13MB like this:

-rw-r--r--  1 mark  staff    13M 23 Aug 17:55 image.jpg

I made 128 such random images because that is nicely divisible by the 8 CPU cores on my machine - see parallel tests later.

Now for the methods...

Method 1

This is the naive method - you just create all the files you asked for, one after the other.

#!/bin/bash
for f in image*jpg; do
   for w in 3000 2000 1000 500 250 100; do
      convert $f -resize ${w}x res_${f}_${w}.jpg
   done 
done

Time: 26 mins 46 secs

Method 2

Here we only read each image once, but generate all output sizes from the one input image and it is considerably faster.

#!/bin/bash
for f in image*jpg; do
   convert $f -resize 3000x -write res_${f}_3000.jpg \
              -resize 2000x -write res_${f}_2000.jpg \
              -resize 1000x -write res_${f}_1000.jpg \
              -resize 500x  -write res_${f}_500.jpg  \
              -resize 250x  -write res_${f}_250.jpg  \
              -resize 100x  res_${f}_100.jpg
done

Time: 6 min 17 secs

Method 3

Here we advise ImageMagick up-front that the largest image we are going to need is only 3000x2250 pixels, so it can use less memory and read fewer DCT levels in and do less I/O. This is called "shrink-on-load".

#!/bin/bash
for f in image*jpg; do
   convert -define jpeg:size=3000x2250 $f            \
              -resize 3000x -write res_${f}_3000.jpg \
              -resize 2000x -write res_${f}_2000.jpg \
              -resize 1000x -write res_${f}_1000.jpg \
              -resize 500x  -write res_${f}_500.jpg  \
              -resize 250x  -write res_${f}_250.jpg  \
              -resize 100x  res_${f}_100.jpg
done

Time: 3 min 37 s

Just as an aside, to demonstrate the reduced time, I/O and memory needed when you tell ImageMagick up-front how big you are going to need an image up-front, compare these two commands, both reading one of your 8000x6000, 13MB images and both generating the same thumbnail:

/usr/bin/time -l convert image.jpg -resize 500x result.jpg 2>&1 | egrep "resident|real"        
1.92 real         1.77 user         0.14 sys
415727616  maximum resident set size

i.e. 415 MB and 2 seconds

/usr/bin/time -l convert -define jpeg:size=500x500 image.jpg -resize 500x result.jpg 2>&1 | egrep "resident|real"

0.24 real         0.23 user         0.01 sys
23592960  maximum resident set size

i.e. 23 MB and 0.2 seconds - and the output image has the same contents and quality.

Method 4

Here we go all-out and use GNU Parallel as well as all the foregoing techniques to send your CPU's, fans and power consumption crazy!!!

#!/bin/bash
for f in image*jpg; do
   cat<<EOF
convert -define jpeg:size=3000x2250 $f          \
              -resize 3000x -write res_${f}_3000.jpg \
              -resize 2000x -write res_${f}_2000.jpg \
              -resize 1000x -write res_${f}_1000.jpg \
              -resize 500x  -write res_${f}_500.jpg  \
              -resize 250x  -write res_${f}_250.jpg  \
              -resize 100x  res_${f}_100.jpg
EOF
done | parallel

Time: 56 seconds

In summary, we can reduce the processing time from 27 minutes to 56 seconds by avoiding unnecessarily reading the image and doing as many outputs per input as possible, by telling ImageMagick up front how much of the input image it needs to read and by using GNU Parallel to keep all your lovely CPU cores busy. HTH.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432