5

I have a speed critical program that will repeatedly reads images from disk and compute values from them. The images are too many to store in memory.

The same set of images will be read, we will not change/edit them, and their order is fix.

Not all the images have the same size, but they all have roughly 1 Mb when encoded as PNG. There are tens of thousands of them and most of the RAM is already used to store the computed values.

Other than buying faster disks, or using RAID, what is the fastest way to read a sequence of images ?

Would it be faster to have them all inside a large tar file (and read them with custom untaring code), instead of as individual files in a folder ?

I could not find a multi-threaded implementation of PNG decoding, so this stage may also become a bottleneck. Would using WebP instead of PNG provide an additional speed edge ?

Which other ideas should I consider/evaluate ?

rodrigob
  • 2,891
  • 3
  • 30
  • 34
  • 1
    If I would guess, there is no faster way than reading one by one. Thinking about it, the bottleneck would be disk access, and there is no way around it... The only thing that come's to my mind is to choose a image format that give's less overhead to the disk->cpu transfer. Decoding the image is probably a lot faster than reading the file from disk. – Ian Medeiros May 17 '13 at 13:36
  • Have you tried any alternatives and have some profiling results? Using SSD disks is an option? – Roger Rowland May 17 '13 at 13:36
  • Perhaps you should read them once, compute whatever values/information you need from them, and store that information, so you don't ever have to read all the images again (at least not until either the images or the calculated information you need from them changes). – twalberg May 17 '13 at 14:10
  • @twalberg that is certainly an option, but the exact data I compute changes regularly, and turns out to be _larger_ that the original images. Writing and reading all that data would be slower that just recomputing it on the flight. – rodrigob May 17 '13 at 14:21
  • @IanMedeiros: packing images into single file might be faster, because antivirus software, if present, might scan every file you open. If you open file once, there'll be one check. If you open many files, there'll be multiple checks. – SigTerm May 17 '13 at 14:27
  • @SigTerm worry not, there is not antivirus running on my workstations ;p. For this question, there is only two applications running on the machine, my speed sensitive program and the OS; nothing else. – rodrigob May 17 '13 at 14:29
  • @RogerRowland I do have access to SSD, but not all machines on the cluster have SSD. I am asking this question to know what ideas evaluate. I will report my results once done. – rodrigob May 17 '13 at 14:33
  • @rodrigob Ah... the way I read your question, I was imagining the "value" you computed was some sort of summary statistic, not a transform of the image or something on that scale... – twalberg May 17 '13 at 14:40
  • If you have a cluster, why don't you parallelize it, e.g. process image# `n` on machine `n mod k` (where `k` is the number of machines)? – smocking May 17 '13 at 15:14
  • @smocking that is already done, after spreading chunks of the (very) large image set to each machine (done once), I still need to maximize the speed image reading in each of them (done many times). – rodrigob May 17 '13 at 15:20
  • @rodrigob, OK that makes sense. Well, good excuse to get a bigger cluster. :-) One thing that comes to mind is that JPEG-2000 might get a better compression ratio than PNG for photographic images and is still lossless. Formatting the filesystem to have a block size slightly larger than the images can help too, but might not be worth the trouble. – smocking May 17 '13 at 15:31
  • @rodrigob: You might mention that you're running it either on server or on linux. AV impacts performance on windows platform. – SigTerm May 17 '13 at 16:20

5 Answers5

5

Dear stack overflow community,

as promised here are the results of the experiments done based on your many suggestions. A special thanks to @user894763 how put me on the "right path".

tl;dr use pnm files inside an uncompressed tar (yes I said pnm !).

I have done experiments on two high end machines, one enabled with SSD disks, and the other one using a networked file system. Both have high end CPUs, but show "two side of the spectrum" on disk access. Surprisingly, the conclusions are the same for both machines. I report only one set of results (for the later case). The ratios amongst file formats are almost identical in both experiments.

From these experiments I have learned two important things:

  • When regarding files from disk, the operative system disk cache is king (i.e. the operative systems tries as much as possible to keep file operations in RAM instead of the physical device, and it does a really good job at this).
  • Contrary to my initial guess, reading images from disk is a CPU bounded operation, not an I/O bounded.

Experiment protocol

I am reading a set of ~1200 images in a fix sequence, no computation is done on the images, I am simply measuring the time to load the pixels in memory. The tar files sizes are ~600 MB in pnm format, ~300 MB in png format, and ~200 MB in webp format.

"Fresh read" means first read done on the machine.
"Cached read" means the second read done on the same machine (and any subsequent one).

All numbers are roughly +- 10 Hz.

webp fresh read: 30 Hz
webp cached read: 80 Hz

webp + tar fresh read: 100 Hz
webp + tar cached read: 100 Hz

png fresh read:  50 Hz
png cached read: 165 Hz

png + tar fresh read: 200 Hz
png + tar cached read: 200 Hz

pnm fresh read: 50 Hz
pnm cached read: 600 Hz

pnm + tar fresh read: 200 Hz
pnm + tar cached read: 2300 Hz

Notes

I was told that maybe there is way to change the webp compression parameters to make the decompression faster. I suspect that it would still not match the pnm performance.

Please note that I used custom code to read the images in the tar file, the file is read from disk "image by image".

I do not know why reading the webp images "fresh" was slower than the png ones, I can only speculate that the networked disk system had some "internal" cache that somewhat changed the behaviour. However this does not affect the lessons.

Lessons

  1. If you will read a file (or a set of files) multiple times, the operative system disk cache will make all future reads essentially "as fast as reading from RAM".

  2. Even when reading from disk the time to decompress images is non-negligible.

  3. Putting all the files into single uncompressed (tar) file, makes things significantly faster because the operative system will assume that the whole file will be read, pre-loading future images even before we access them. This seem not to happen when simply reading inside a folder.

  4. With proper care, a factor 4x ~ x10 in speed-up can be obtained when reading a sequence of images from disk (specially if read repeatedly).

rodrigob
  • 2,891
  • 3
  • 30
  • 34
  • Following https://groups.google.com/a/webmproject.org/forum/?fromgroups=#!topic/webp-discuss/FPOfZs2cCS4 tried "cwebp -preset photo -q 100" instead of "cwebp -preset photo -lossless". The new tar file now is ~100 MB, reading webp files fresh is 90 Hz, 105 Hz cached. From tar file it is 115 Hz (both fresh and cached). Better, but not better than pnm + tar (plus we may have introduced minor compression artefacts). – rodrigob Jun 17 '13 at 17:39
3

PNG is not built for speed. It's slower than jpeg and no smaller than tif. If you're stuck with PNG, no other optimisations will make any difference.

For example:

$ time vips avg wtc.tif
117.853995
real    0m0.525s
user    0m0.756s
sys 0m0.580s
$ time vips avg wtc.png
117.853995
real    0m3.622s
user    0m3.984s
sys 0m0.584s

where "wtc" is a 10,000 x 10,000 pixel RGB photo, the tif is uncompressed strip format and the png is also uncompressed, both images were in disc cache, and "avg" finds and prints the average pixel value.

vips has its own ".v" format which is a simply a huge buffer of pixels. This format can be read in parallel with mmap() and is a bit quicker again:

$ time vips avg wtc.v
117.853995
real    0m0.162s
user    0m0.460s
sys 0m0.092s

If your images can be compressed the tradeoffs shift a bit. For example, jpeg will typically compress 10x, so decode speed becomes much more important than disc speed. You'd want to use an optimised decode library like libturbojpeg and process several files at once.

$ time vips avg wtc.jpg
117.853995 
real    0m1.413s
user    0m1.696s
sys 0m0.564s

PNG uses libz and for photographic images won't get more than about 2x compression. Even at the same compression levels it's quite a lot slower than tif with deflate:

$ time vips avg wtc.tif
117.853995
real    0m3.154s
user    0m3.496s
sys 0m0.540s
$ time vips avg wtc.png
117.853995
real    0m4.888s
user    0m5.196s
sys 0m0.556s
$ ls -l wtc.*
-rw-r--r-- 1 john john  15150881 Feb 20  2012 wtc.jpg
-rw-rw-r-- 1 john john 135803013 May 18 12:47 wtc.png
-rw-rw-r-- 1 john john 143807446 May 18 12:53 wtc.tif
-rw-rw-r-- 1 john john 263509369 May 18 12:37 wtc.v

I suppose the other factor is how time-consuming your processing is. If you're doing something intensive, read speed and decode speed will not be important.

jcupitt
  • 10,213
  • 2
  • 23
  • 39
  • My processing is currently highly optimized (GPU enabled), it is so fast that disk I/O became a bottleneck when doing operations from disk. Indeed I have identified that PNG decompression is an issue. I need to use a lossless format, but then multiple choices are possible. I am surprised to see that reading the "v" format is faster than jpg, would this also apply to simple pgm images? I am considering benchmarking WebP which should be both faster to decompress and more compact. – rodrigob May 18 '13 at 20:37
  • The initial results with WebP are not encouraging at all, let us see how this evolves. https://groups.google.com/a/webmproject.org/forum/?fromgroups=#!topic/webp-discuss/FPOfZs2cCS4 – rodrigob Jun 12 '13 at 18:59
1

You should reverse the order of reading. That is, in the first pass read from image 1 to image N, then in the second pass read from image N to image 1, then in the third pass read from image 1 to image N and so on. That way you'll hit the disk cache more.

Processing (or at least loading) several images at once, in different threads, might benefit the overall throughput too, because the OS will then be able to optimize the disk seeks.

If the OS have a good support for AIO then it might be beneficial as well.

Putting images into a single file might indeed help to minimize the seeks (depends on the filesystem defragmentation policies, though). In that case you should use an archive with fast access to a single file, in order to be able to read files in reverse order, e.g. "zip" with no compression.

With memory mapping there should be an option to ask the OS to prefetch a portion of the memory mapped file (e.g. MAP_POPULATE). Reading large portions of the archive that way might be faster then reading it block by block.

ArtemGr
  • 11,684
  • 3
  • 52
  • 85
  • Reversing the order sounds like and interesting idea. However in my case, in between passes, I will be storing some results to disk. This will probably kill the benefits from the suggested trick (unless the stored results are much smaller than the disk cache, I guess). – rodrigob May 17 '13 at 14:08
0

Memory mapping, especially since you plan to re-read images several times will be the fastest method to get the data into RAM with as few copies as possible.
Using "clever tricks" like unbuffered reads to take advantage of DMA is not advisable since this will not use buffers, which is orders of magnitude faster than disk. This may be an advantage when touching data once and only once -- but never if you want to read a piece more than once as in your case. Normal buffered reads are usually measurably slower than memory mapping too, since they need to do a memory copy.

On a typical harddisk, you can expect a performance of around 100 MB/s on the first run, and 3-4 GB/s from buffers on the second and further runs (possibly more on a fast machine).

Decoding PNG involves decrompressing a LZ77 stream, so this may become a limiting factor, too. To combat this, you can multithread. It is not entirely trivial to multithread decoding a single stream, but nothing hinders you from decoding several images at the same time (which is pretty trivial).

Concatenating the images into one huge file may give an advantage as it may reduce seeks, but this usually only starts to really matter if you have to read in hundreds or thousands of files. In that case, you'll preferrably store them in the order in which you'll read them, too (hoping that this results in a contiguous layout on the disk, no guarantee however).

Damon
  • 67,688
  • 20
  • 135
  • 185
  • How would memory mapping help speed ? The amount of data to read is much larger than available RAM, and the data is read in full passes. By the time I get to the last image the memory buffer will have "lost" the data from the first image, so the next time I start reading the images, the OS will have to start to read from disk again. Or I am missing something ? – rodrigob May 17 '13 at 13:49
  • Even so memory mapping is a benefit, though slightly less (buffer obviously won't help too much on much-larger-than-physical RAM sets). With mapping, you only need one copy in physical RAM (the one that the OS owns gets a page table entry for your application), so you effectively consume half as much physical memory. Plus, you save doing the actual copy, and the prefetching works much more "naturally" and more effective. – Damon May 17 '13 at 14:10
  • I could see how some RAM is saved, however regarding prefetching, this http://lemire.me/blog/archives/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/ seems to be not the case when doing large sequential reads. – rodrigob May 17 '13 at 14:27
  • [This old question of mine](http://stackoverflow.com/q/5909345/572743) is about a different issue, but includes benchmarks for memory mapping vs. sequential reads. Memory mapping is faster than any other method. The timings on the site you linked to are certainly not correct, they're not even plausible (streams faster than raw libc reads, faster than memory mapping). If you think about it, this is not even theoretically possible on most OSs because they implement `read` via filemapping internally. libc is layered on that (with an extra copy) and streams are layered on that. – Damon May 17 '13 at 20:24
  • It is very well possible that `fread` is faster than `read` (or `ReadFile`) when reading tiny amounts (reading 1-2 bytes a few million times) due to buffering reducing syscall overhead, but no such thing is true for filemapping. It is not possible to access data that is readily present faster by first copying it to a library buffer and then to another location in the application. – Damon May 17 '13 at 20:30
  • DMA is being used by the driver (if available and enabled). You have no control over it in user mode. – user1764961 May 18 '13 at 13:01
  • @user1764961: Not true. Page faults naturally use DMA if available (that's the default case on not-15-year-old hardware with a not-15-year-old OS), but for standard I/O you _certainly_ have control whether I/O to your address space happens via DMA. Depending on the API and mode you use you get or do not get DMA. Under Windows, that would be overlapped I/O and under Linux it would be using kaio, in both cases with buffering disabled (which, as I pointed out, is usually an anti-optimization). – Damon May 20 '13 at 16:50
0

You should ask yourself,

  • How much time it takes to compute whatever you are computing on one unit (either a full image or a segment of it).
  • During this time, how many units of image you can read (Let's say N).

I don't know how to make reading of single image unit faster but there is something else you can try.

Create a shared/global variable to hold the units of image. Use a thread to store an unit of image in it. If N is less 1, it means you will read faster than you can consume your image therefore it does NOT help much to go for a faster read anymore. However, if your consuming image much faster (e.g. N threads are working together to consume images) then you need more threads to store enough units of image in memory.

Building a consumer-producer model using threads is straight-forward in theory. But implementation is often tricky.

PS : Running more than 1 thread on single processor is usually less efficient than a normal thread-less program. Unless you have many-core machines, I dont't see a way for improvement.

Dilawar
  • 5,438
  • 9
  • 45
  • 58
  • The bottleneck is not CPU processing speed, it's disk->cpu transfer. Even if you implement a multi-thread reader, using 100% of the CPU time to do that, all the threads will need to access the north bridge bus before accessing the disk memory. The producer will be much slower in the case you are describing, wich will slow down the system overall. – Ian Medeiros May 17 '13 at 18:12