At once I/O performing slower than reading a a little at a time

Question

I am working on optimizing and algorithm that we are preparing to put on a GPU using cuda.

The I/O part reads in from 3 different images, one row at a time. This was right in the middle of the loop for running the filter over the images. I decided to try to pre-load the values that were being generated by removing the I/O to its own loop, and dumping the values out to arrays that held the images, and were used in the calculation.

Now, the problem is, it seems like my application is running slower with the buffers fully loaded with data, and faster when it was having to go out to disk for new data every iteration.

What could be causing this? Would cache misses from the larger buffers really kill performance that much? Its not a memory issue - with 24GB on this machine it has plenty of ram.

Not sure what else it could be, open to hearing out ideas

Actually I think that those images are already pre-fetched in the read-ahead cache by the OS when you start to read them at the first iteration. — Matteo Italia, Jan 20 '11 at 01:12
@Derek: in general it's OS stuff; when the first read is done it's convenient to read the whole sector (or even the following ones if the file is not fragmented) while the disk head is over it (you get that data "for free"), so usually the OS has a special cache for this kind of stuff. Also `stdio` and `iostream` have their caches, but I don't think that here they are impacting much. — Matteo Italia, Jan 20 '11 at 01:17
Are you sure it is running slower? (I ask you say "seems like my application is running slower"). What are the actual time differences? Also, how big are the files? — Mark Wilkins, Jan 20 '11 at 01:21
If it's in its own thread, are you using any synchronization (mutexes/spinlock)? Have you verified sync is not causing your performance problems? If so, how? — SoapBox, Jan 20 '11 at 01:23
@Derek: What OS are you running under? Can you give an idea of the complexity of the filter? Image size? In two consecutive runs do you get similar performance? As others mentioned, don't assume you're actually going out to disk, the OS may be caching them. There's still a cost though even when they're cached. — Guy Sirton, Jan 20 '11 at 01:41
Its definitely running slower - over a minute, compared to 10 - 14 seconds before. I am not doing any specific threading, though I do have some OpenMP pragmas. Moving the I/O outside of the filter loop did not change any of those though. I am running CentOS 5.5. The image size is approx 72MB — Derek, Jan 20 '11 at 03:32
That means about 1/5 of the time is useful work and 4/5 is wasted. If you [randomly pause](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024) it 5 times and look at what it's doing and why, it will show you the problem on 4 of those times, on average. — Mike Dunlavey, Jan 20 '11 at 20:29

Guy Sirton · Answer 1 · 2011-01-20T20:20:27.457

@Derek provided the following additional information:

(Run time) ... "is over a minute, compared to 10 - 14 seconds before. I am not doing any specific threading, though I do have some OpenMP pragmas. Moving the I/O outside of the filter loop did not change any of those though. I am running CentOS 5.5. The image size is approx 72MB"

That is a huge difference in run time. Since OpenMP is used we can assume there are multiple threads. Since you're only dealing with 72MB of data I can't see how the difference in I/O time could be that large. We can be positive the read time is smaller than your original 10-14 seconds so unless you have a bug in that portion of the code the extra time is in the filter section. The images are presumably binary? As @Satya has suggested profiling your code or at least adding some timing printouts may help identify where the problem lies.

The "advantage" of reading in the loop may be:

The OS is giving you some parallelism because it is able to perform some of the I/O in parallel with your computation, e.g. reading ahead. You lose that parallelism when you read everything in advance, effectively blocking while reading.
The read data is in the cache at the time that your filter is accessing the data. Cache misses can really kill performance if the processing is lightweight relative to the memory bandwidth. It's hard to believe this would make a significant difference in this use case because disk I/O is so much slower than memory.

Given your latest update it does seem more likely we're dealing with #2. Something to watch out for though is the memory access patterns (including all threads), it is possible you are seeing cache thrashing because data that used to adjacent in main memory is now further apart. This could have a large impact because if you have many memory accesses and they are all cache misses you always incur the cost of accessing the data further out which can be an order of magnitude difference.

A solution to this is to arrange your memory in stripes, e.g. n lines from the first image, followed by n lines from the second image, followed by n lines from the third image. IIRC this technique is called "striping". The exact stripe size depends on your CPU but it's something you can experiment with (or start with the same amount of data that used to be read in the inner loop if that's large enough).

E.g.:

stripe_number = 0;
do
{
    count = fread(striped_buffer+(STRIPE_SIZE*stripe_number*NUM_IMAGES), 1, STRIPE_SIZE, image_file);
    stripe_number++;
} while(count != 0);

Read one file at a time so you're not seeking back and forth on your drive.

Regardless, to maximize performance you probably want to look into using asynchronous/overlapped I/O to have your next bit of image data coming in during the time you are processing the previous bit.

If you're developing under Windows this can give you a start on doing overlapped I/O: http://msdn.microsoft.com/en-us/library/ms686358%28v=vs.85%29.aspx

Once you are doing your I/O in parallel you can figure out if your bottleneck is in the I/O or in the processing. There are different techniques for optimizing those.

score 0 · Answer 2 · answered Jan 20 '11 at 01:15

Yes, you load your image into L2 cache twice - when you load it from the file and then from the memory. You have to also spend some time to move data from the cache to the memory.

As an option you could try to load some parts like 2-8Mb (depending of your L2 cache size)

score 0 · Answer 3 · answered Jan 20 '11 at 09:21

In addition to @Guy: answer, I should mention memory mapped files, they have the best parts of both approaches. However, to should take about a second to read 70Mb, so the problem lies somewhere else.

It could be caused by coherence of core caches. I don't know much about this, but if two threads at the same time have write access to the same memory page (or worse, to the same cache line), then their caches have to be synchronized. When you read the whole image at once, then all your processing threads will process them in the same time. Will they write the results in close memory addresses? In case when you read the images line by line, they will spend some time waiting for I/O to complete, so it won't happen so often.

At once I/O performing slower than reading a a little at a time

3 Answers3