2

I'm trying to instrument some functions in my application to see how long they take. I'm recording all of the times in-memory using a linked list.

In the process, I've introduced a global variable that keeps track of the end of the list. When I enter a new timing region, I insert a new record at the end of the list. Fairly simple stuff.

Some of the functions I want to track are called in OpenMP regions, however. Which means they'll likely be called multiple times in parallel. And this is where I'm stumped.

If this was using normal Pthreads, I'd simply wrap access to the global variable in a mutex and call it a day. However, I'm unsure: will this strategy still work with functions called in an OpenMP region? As in, will they respect the lock?

For example (won't compile, but I think gets the point across):

Record *head;
Record *tail;

void start_timing(char *name) {
    Record *r = create_record(name);
    tail->next_record = r;
    tail = r;
    return r;
}

int foo(void) {
   Record r = start_timing("foo");
   //Do something...
   stop_timing(r);
}

int main(void) {
   Record r = start_timing("main");
   //Do something...
   #pragma omp parallel for...
   for (int i = 0; i < 42; i++) {
       foo();
   }
   //Do some more...
   stop_timing(r);
}

Which I would then update to:

void start_timing(char *name) {
    Record *r = create_record(name);

    acquire_mutex_on_tail();
    tail->next_record = r;
    tail = r;
    release_mutex_on_tail();

    return r;
}

(Apologies if this has an obvious answer - I'm relatively inexperience with the OpenMP framework and multithreading in general.)

tonysdg
  • 1,335
  • 11
  • 32

1 Answers1

5

The idiomatic mutex solution is to use OpenMP locks:

omp_set_lock(&taillock)
tail->next_record = r;
tail = r;
omp_unset_lock(&taillock)

and somewhere:

omp_lock_t taillock;
omp_init_lock(&taillock);

...

omp_destroy_lock(&taillock);

The simple OpenMP solution:

void start_timing(char *name) {
    Record *r = create_record(name);
    #pragma omp critical
    {
        tail->next_record = r;
        tail = r;
    }
    return r;
}

That creates an implicit global lock bound to the source code line. For some detailed discussions see the answers to this question.

For practical purposes, using Pthread locks will also work, at least for scenarios where OpenMP is based on Pthreads.

A word of warning

Using locks in performance measurement code is dangerous. So is memory allocation, which also often implies using locks. This means, that start_time has significant cost and the performance will even get worse with more threads. That doesn't even consider the cache invalidation from having one thread allocating a chunk of memory (record) and then another thread modifying it (tail pointer).

Now that may be fine if the sections you measure take seconds, but it will cause great overhead and perturbation when your sections are only hundreds of cycles.

To create a scalable performance tracing facility, you must pre-allocate thread-local memory in larger chunks and have each thread write only to it's local part.

You can also chose to use some of the existing measurement infrastructures, such as Score-P.

Overhead & perturbation

First, distinguish between the two (linked concepts). Overhead is extra time you spend, while perturbation refers to the impact on what you measure (i.e. you now measure something different than what happens without the measurement). Overhead is undesirable in large quantities, but perturbation is much worse.

Yes, you can avoid some of the perturbation by pausing the timer during your expensive measurement runtime (the overhead remains). However, in a multi-threaded context this is still very problematic.

  • Slowing down progress in one thread, may lead to other threads waiting for it e.g. during an implicit barrier. How do you attribute the waiting time of that thread and others that follow transitively?
  • Memory allocation is usually locked - so if you allocate memory during measurement runtime, you will slow down other threads that depend on memory allocation. You could try to mitigate with memory pools, but I'd avoid the linked list in the first place.
Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • Thanks for the Score-P link; I'll have to investigate it more thoroughly! Question: I should have mentioned that I'm currently "pausing" timing of the outer function before entering the inner function, and the resuming timing on returning from the inner function. Thus, the lock and the malloc shouldn't be captured in my timing information. If I don't begin timing until after the lock and memory allocation (as in, I malloc the record, lock and insert it into place, and then begin timing), would I still have that performance overhead issue? – tonysdg Dec 04 '16 at 17:32
  • Please see my added last paragraph. – Zulan Dec 04 '16 at 21:41