How does OpenMP implement access to critical sections?

Question

I want to read an input file (in C/C++) and process each line independently as fast as possible. The processing takes a few ticks itself, so I decided to use OpenMP threads. I have this code:

#pragma omp parallel num_threads(num_threads)
{
  string line;
  while (true) {
#pragma omp critical(input)
    {
      getline(f, line);
    }
    if (f.eof())
      break;
    process_line(line);
  }
}

My question is, how do I determine the optimal number of threads to use? Ideally, I would like this to be dynamically detected at runtime. I don't understand the DYNAMIC schedule option for parallel, so I can't say if that would help. Any insights?

Also, I'm not sure how to determine the optimal number "by hand". I tried various numbers for my specific application. I would have thought the CPU usage reported by top would help, but it doesn't(!) In my case, the CPU usage stays consistently at around num_threads*(85-95). However, using pv to observe the speed at which I'm reading input, I noted that the optimal number is around 2-5; above that, the input speed becomes smaller. So my quesiton is- why would I then see a CPU usage of 850 when using 10 threads?? Can this be due to some inefficiency in how OpenMP handles threads waiting to get in the critical section?

EDIT: Here are some timings. I obtained them with:

for NCPU in $(seq 1 20) ; do echo "NCPU=$NCPU" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; PID=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; USAGE=$(ps h -o "%cpu" $PID) ; kill -9 $PID ; sleep 1 ; echo "usage: $USAGE" >&2 ; } & cat ; } | ./my_prog -N $NCPU >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v Killed ; done

NCPU=1 [8.27MB/s] usage: 98.4

NCPU=2 [12.5MB/s] usage: 196

NCPU=3 [18.4MB/s] usage: 294

NCPU=4 [23.6MB/s] usage: 393

NCPU=5 [28.9MB/s] usage: 491

NCPU=6 [33.7MB/s] usage: 589

NCPU=7 [37.4MB/s] usage: 688

NCPU=8 [40.3MB/s] usage: 785

NCPU=9 [41.9MB/s] usage: 884

NCPU=10 [41.3MB/s] usage: 979

NCPU=11 [41.5MB/s] usage: 1077

NCPU=12 [42.5MB/s] usage: 1176

NCPU=13 [41.6MB/s] usage: 1272

NCPU=14 [42.6MB/s] usage: 1370

NCPU=15 [41.8MB/s] usage: 1493

NCPU=16 [40.7MB/s] usage: 1593

NCPU=17 [40.8MB/s] usage: 1662

NCPU=18 [39.3MB/s] usage: 1763

NCPU=19 [38.9MB/s] usage: 1857

NCPU=20 [37.7MB/s] usage: 1957

My problem is that I can achieve 40MB/s with 785 CPU usage, but also with 1662 CPU usage. Where do those extra cycles go??

EDIT2: Thanks to Lirik and John Dibling, I now understand that the reason I find the timings above puzzling has nothing to do with I/O, but rather, with the way OpenMP implements critical sections. My intuition is that if you have 1 thread inside a CS and 10 threads waiting to get in, the moment the 1st thread exits the CS, the kernel should wake up one other thread and let it in. The timings suggest otherwise: can it be that the threads wake up many times on their own and find the CS occupied? Is this an issue with the threading library or with the kernel?

getline is a pretty complicated function, and putting it inside an OpenMP critical section may not be the best way to deal with things. Have you considered having a single thread to do all the getlines and put the results into a queue, and then the work threads just pop from the queue inside the critical section? (Obviously this is not an answer to your actual question.) — abarnert, Jun 06 '12 at 21:47
First, I'm willing to pay the price of C++ vs C, i.e., use `string` instead of `char[]`. Either way, `getline` or a C equivalent needs to run. Second, I/O wait and `getline` are not parallelizable. So I have 2 options: put them in a CS as above, or like you suggest, have 1 dedicated thread do all of those, enqueue the lines, and have the other threads dequeue lines. I don't see the benefit of one over the other. My goal is to have the process input bound _and_ use the least amount of CPU. — Matei David, Jun 07 '12 at 16:36

score 2 · Answer 1 · edited May 23 '17 at 10:09

"I want to read an input file (in C/C++) and process each line independently as fast as possible."

Reading from file makes your application I/O bound, so the maximum performance you would be able to achieve for the reading portion alone is to read at the maximum disk speed (on my machine that's less than 10% CPU time). What this means is that if you were able to completely free the reading thread from any processing, it would require that the processing takes less than the remaining CPU time (90% with my computer). If the line processing threads take up more than the remaining CPU time, then you will not be able to keep up with the hard drive.

There are several options in that case:

Queue up the input and let the processing threads dequeue "work" until they've caught up with the input that was presented (given that you have enough RAM to do so).
Open enough threads and just max out your processor until all the data is read, which is your best effort scenario.
Throttle reading/processing so that you don't take up all of the system resources (in case you're worried about UI responsiveness and/or user experience).

"...the processing takes a few ticks itself, so I decided to use OpenMP threads"

This is a good sign, but it also means that your CPU utilization will not be very high. This is the part where you can optimize your performance and it's probably best to do it by hand, as John Dibling mentioned. In general, it's best if you queue up each line and let the processing threads pull processing requests from the queue until you have nothing more to process. The latter is also know as a Producer/Consumer design pattern- a very common pattern in concurrent computing.

Update

Why is there be a difference between

(i) each process get lock, pull data, release lock, process data; and

(ii) one process: pull data, get lock, enqueue chunk, release lock,

others: get lock, dequeue chunk, release lock, process data?

There is very little difference: in a way, both represent a consumer/producer pattern. In the first case (i) you don't have an actual queue, but you could consider the file stream to be your Producer (queue) and the Consumer is the thread that reads from the stream. In the second case (ii) you're explicitly implementing the consumer/producer pattern, which is more robust, reusable and provides better abstraction for the Producer. If you ever decide to use more than one "input channel," then the latter case is better.

Finally (and probably most importantly), you can use a lock-free queue with a single producer and a single consumer which will make (ii) a lot faster than (i) in terms of getting you to be i/o bound. With a lock-free queue you can pull data, enqueue chunk and dequque chunk without locking.

Bravo for pointing out that I/O boundedness can't be fixed by throwing threads at the problem. — Kuba hasn't forgotten Monica, Jun 07 '12 at 00:04
@KubaOber: You misunderstood what I'm trying to do. I _want_ the process to be input bound, that's the whole point of separating the data pulling from processing. But I want to do it with the least amount of CPU usage. — Matei David, Jun 07 '12 at 20:05
@Lirik: Why is there be a difference between (i) each process get lock, pull data, release lock, process data; and (ii) one process: pull data, get lock, enqueue chunk, release lock, others: get lock, dequeue chunk, release lock, process data? Anyhow, pulling all data in RAM is not an option. And I don't want to max out the CPU, only to use just enough to make it completely input bound. — Matei David, Jun 07 '12 at 20:11
@mmd see the update of my answer... btw, you can run out of RAM if processing cannot keep up with input and there is nothing you can do about that except throttle down the input. However, if that's not the case then you should be able to make the whole thing I/O bound. — Kiril, Jun 07 '12 at 20:52
You are right, this would remove some of the waiting for locks. But if I use a lock-free queue, my original problem remains: how do I find the optimal number of consumers? Too little and the queue overflows; too many and they waste CPU cycles performing dequeue on an empty queue. My initial hope was to throw many threads at the problem not to remove IO boundedness, but in the hope the scheduler is smart enough to let them sleep if they don't have work to do (and keep CPU usage low). The same problem remains with a lock-free queue. — Matei David, Jun 07 '12 at 21:18
@mmd If the processing indeed takes just a few ticks, then that's probably faster than the number of ticks it would take to read the data (optimal for lock-free queue which only works for 1 producer and 1 consumer). That should mean that you can get I/O bound just by splitting the work between two threads. If a single consumer thread doesn't cut it, then you're going to need blocking queue (supports multiple consumers/producers), then all of the consumers threads will block until there is work queued up. Note: blocking doesn't waste CPU cycles, so you're good there too. — Kiril, Jun 07 '12 at 21:31

score 1 · Answer 2 · answered Jun 06 '12 at 21:34

The best you can hope to do is tune it yourself, by hand, through repeated measure-adjust-compare cycles.

The optimal number of threads to use to process a dataset is highly dependant on many factors, not the least of which:

The data itself
The algoritm you use to process it
The CPU(s) the threads are running on
The operating system

You could try to design some kind of heuristic that measures the throughput of your processors and adjust it on the fly, but this kind of thing tends to be way more trouble than its worth.

As a rule, for tasks that are I/O bound I generally start with about 12 threads per core and tune from there. For tasks that are CPU bound, I'd start with about 4 threads per core and go from there. The key is the "going from there" part if you really want to optimize your processor use.

Also keep in mind that you should make this setting configurable if you really want to optimize, because every system onn which this is deployed will have different characteristics.

I guess my main concern is that the OpenMP threads seem to do busy waiting at the CS entrance; otherwise I can't explain the higher CPU usage at a fixed IO rate. Or, if not completely 'busy', then at least 'inefficient'. I know one program- `pigz` which seems to avoid this, it adapts the CPU usage to the IO rate automatically. I guess my question really is how hard is to achieve that. I'll post some timings to demonstrate... — Matei David, Jun 07 '12 at 18:17

How does OpenMP implement access to critical sections?

2 Answers2

Update