@Derek provided the following additional information:
(Run time) ... "is over a minute, compared to
10 - 14 seconds before. I am not doing
any specific threading, though I do
have some OpenMP pragmas. Moving the
I/O outside of the filter loop did not
change any of those though. I am
running CentOS 5.5. The image size is
approx 72MB"
That is a huge difference in run time. Since OpenMP is used we can assume there are multiple threads. Since you're only dealing with 72MB of data I can't see how the difference in I/O time could be that large. We can be positive the read time is smaller than your original 10-14 seconds so unless you have a bug in that portion of the code the extra time is in the filter section. The images are presumably binary? As @Satya has suggested profiling your code or at least adding some timing printouts may help identify where the problem lies.
The "advantage" of reading in the loop may be:
- The OS is giving you some parallelism because it is able to perform some of the I/O in parallel with your computation, e.g. reading ahead. You lose that parallelism when you read everything in advance, effectively blocking while reading.
- The read data is in the cache at the time that your filter is accessing the data. Cache misses can really kill performance if the processing is lightweight relative to the memory bandwidth. It's hard to believe this would make a significant difference in this use case because disk I/O is so much slower than memory.
Given your latest update it does seem more likely we're dealing with #2. Something to watch out for though is the memory access patterns (including all threads), it is possible you are seeing cache thrashing because data that used to adjacent in main memory is now further apart. This could have a large impact because if you have many memory accesses and they are all cache misses you always incur the cost of accessing the data further out which can be an order of magnitude difference.
A solution to this is to arrange your memory in stripes, e.g. n lines from the first image, followed by n lines from the second image, followed by n lines from the third image. IIRC this technique is called "striping". The exact stripe size depends on your CPU but it's something you can experiment with (or start with the same amount of data that used to be read in the inner loop if that's large enough).
E.g.:
stripe_number = 0;
do
{
count = fread(striped_buffer+(STRIPE_SIZE*stripe_number*NUM_IMAGES), 1, STRIPE_SIZE, image_file);
stripe_number++;
} while(count != 0);
Read one file at a time so you're not seeking back and forth on your drive.
Regardless, to maximize performance you probably want to look into using asynchronous/overlapped I/O to have your next bit of image data coming in during the time you are processing the previous bit.
If you're developing under Windows this can give you a start on doing overlapped I/O:
http://msdn.microsoft.com/en-us/library/ms686358%28v=vs.85%29.aspx
Once you are doing your I/O in parallel you can figure out if your bottleneck is in the I/O or in the processing. There are different techniques for optimizing those.