I want to read an input file (in C/C++) and process each line independently as fast as possible. The processing takes a few ticks itself, so I decided to use OpenMP threads. I have this code:
#pragma omp parallel num_threads(num_threads)
{
string line;
while (true) {
#pragma omp critical(input)
{
getline(f, line);
}
if (f.eof())
break;
process_line(line);
}
}
My question is, how do I determine the optimal number of threads to use? Ideally, I would like this to be dynamically detected at runtime. I don't understand the DYNAMIC schedule option for parallel
, so I can't say if that would help. Any insights?
Also, I'm not sure how to determine the optimal number "by hand". I tried various numbers for my specific application. I would have thought the CPU usage reported by top
would help, but it doesn't(!) In my case, the CPU usage stays consistently at around num_threads*(85-95). However, using pv
to observe the speed at which I'm reading input, I noted that the optimal number is around 2-5; above that, the input speed becomes smaller. So my quesiton is- why would I then see a CPU usage of 850 when using 10 threads?? Can this be due to some inefficiency in how OpenMP handles threads waiting to get in the critical section?
EDIT: Here are some timings. I obtained them with:
for NCPU in $(seq 1 20) ; do echo "NCPU=$NCPU" ; { pv -f -a my_input.gz | pigz -d -p 20 | { { sleep 60 ; PID=$(ps gx -o pid,comm | grep my_prog | sed "s/^ *//" | cut -d " " -f 1) ; USAGE=$(ps h -o "%cpu" $PID) ; kill -9 $PID ; sleep 1 ; echo "usage: $USAGE" >&2 ; } & cat ; } | ./my_prog -N $NCPU >/dev/null 2>/dev/null ; sleep 2 ; } 2>&1 | grep -v Killed ; done
NCPU=1 [8.27MB/s] usage: 98.4
NCPU=2 [12.5MB/s] usage: 196
NCPU=3 [18.4MB/s] usage: 294
NCPU=4 [23.6MB/s] usage: 393
NCPU=5 [28.9MB/s] usage: 491
NCPU=6 [33.7MB/s] usage: 589
NCPU=7 [37.4MB/s] usage: 688
NCPU=8 [40.3MB/s] usage: 785
NCPU=9 [41.9MB/s] usage: 884
NCPU=10 [41.3MB/s] usage: 979
NCPU=11 [41.5MB/s] usage: 1077
NCPU=12 [42.5MB/s] usage: 1176
NCPU=13 [41.6MB/s] usage: 1272
NCPU=14 [42.6MB/s] usage: 1370
NCPU=15 [41.8MB/s] usage: 1493
NCPU=16 [40.7MB/s] usage: 1593
NCPU=17 [40.8MB/s] usage: 1662
NCPU=18 [39.3MB/s] usage: 1763
NCPU=19 [38.9MB/s] usage: 1857
NCPU=20 [37.7MB/s] usage: 1957
My problem is that I can achieve 40MB/s with 785 CPU usage, but also with 1662 CPU usage. Where do those extra cycles go??
EDIT2: Thanks to Lirik and John Dibling, I now understand that the reason I find the timings above puzzling has nothing to do with I/O, but rather, with the way OpenMP implements critical sections. My intuition is that if you have 1 thread inside a CS and 10 threads waiting to get in, the moment the 1st thread exits the CS, the kernel should wake up one other thread and let it in. The timings suggest otherwise: can it be that the threads wake up many times on their own and find the CS occupied? Is this an issue with the threading library or with the kernel?