Delegates slow IO to section makes it slower with OpenMP

Question

I'm trying to parallelize my application using OpenMP (and C) and wanted to start with the I/O part. Initially the reading and the computational part are sequentials and take around 3 seconds each.

int *mask, width, height
Picture *pic;

pic = readFile("some big file");   // 3 secs
mask = computeMask(width, height); // 3 secs

With OpenMP:

#pragma parallel default(none) shared(pic, mask, width, height)
{
 #pragma sections
 {
  #pragma section
  {
   pic = readFile("some big file");
  }
  #pragma section
  {
   mask = computeMask(width, height);
  }
 }
}

But now the overall time is around 10 seconds (and is really spent in the I/O task).

Before I'm starting blaming the concurrent access on the RAM to create that bottleneck. I'd love to know if there is something I got wrong here.

OpenMP sections execute concurrently, e.g. `computeMask` would execute _while_ `readFile` is reading the file. If `computeMask` uses data that `readFile` pumps, then it would both produce incorrect results (it might access data that still hasn't been read) or (false) sharing would lead to vastly increased cache miss rate. I/O is usually as fast as it can be (bandwidth-limited operation) and the only way to make it faster is to run on a (distributed) system with many I/O controllers. — Hristo Iliev, Dec 04 '12 at 21:18
@HristoIliev the two sections can be executed concurrently as they are not sharing any data with each other. `readFile` pumps some data from the disk into memory while `computeMask` fills another memory area. The final operation uses both of them to compute the destination picture. — greut, Dec 05 '12 at 09:00
I see. Then you might be also limited by the memory bandwidth. Is `readFile` doing some processing (e.g. decompression) on the content of the file or is it just binary reading it into memory? If latter, then you might try memory mapping instead to have the data read on demand from the disk. — Hristo Iliev, Dec 05 '12 at 09:09
The memory bandwidth is the clear limitation here. I'll dive into `mmap` as its (in most cases) a direct correlation. Many thanks! — greut, Dec 05 '12 at 14:50
Any `malloc/mmap` will go into critical section so won't be efficiently parallelized. http://openmp.org/forum/viewtopic.php?f=3&t=714 — greut, Dec 05 '12 at 15:06
`malloc` used to be locking years ago. This is [no longer the case](http://stackoverflow.com/a/13338973/1374437) - `glibc` uses an almost lockless `malloc` implementation. And you do not need to synchronise `mmap` invocations (what you have linked to talks about `mmap` in the context of `malloc`, since `malloc` uses it for large allocations). — Hristo Iliev, Dec 05 '12 at 17:07
You're my new bible! :-) With `mmap`, the `readFile` operation is now almost instantaneous. Marvellous! Many thanks. — greut, Dec 05 '12 at 20:49

Delegates slow IO to section makes it slower with OpenMP

0 Answers0