Implementing critical section

Question

What way is better and faster to create a critical section?

With a binary semaphore, between sem_wait and sem_post.
Or with atomic operations:

#include <sched.h>

void critical_code(){
    static volatile bool lock = false;

    //Enter critical section
    while ( !__sync_bool_compare_and_swap (&lock, false, true ) ){
        sched_yield();
    }

    //...

    //Leave critical section
    lock = false;
}

profile! Profile! profile!...... – Alok Save Jan 30 '12 at 16:17 — Alok Save, Jan 30 '12 at 16:17

score 2 · Answer 1 · answered Jan 30 '12 at 16:17

2

Spin-locks perform better if there is little contention for the lock and/or it is never held for a long period of time. Otherwise you are better off with a lock that blocks rather than spins. There are of course hybrid locks which will spin a few times, and if the lock cannot be acquired, then they will block.

Which is better for you depends on your application. Only you can answer that question.

answered Jan 30 '12 at 16:17

David Heffernan

601,492
42
1,072
1,490

The critical section is not small enough and it may do I/O operations. How do I use hybrid locks? – Squall Jan 30 '12 at 21:56
If any IO may be performed while the lock is held, the only type of lock that makes sense is a one that sleeps on contention. Spinlocks are a horrible approach whenever the duration of the lock is more than a few cycles. For example, `spin_lock; counter++; spin_unlock;` makes sense. `spin_lock; write(foo); spin_unlock;` does not. – R.. GitHub STOP HELPING ICE Jan 31 '12 at 07:38

score 2 · Accepted Answer · answered Jan 30 '12 at 17:14

Regardless of what method you use, the worst performance problem with your code has nothing to do with what type of lock you use, but the fact that you're locking code rather than data.

With that said, there is no reason to roll your own spinlocks like that. Either use pthread_spin_lock if you want a spinlock, or else pthread_mutex_lock or sem_wait (with a binary semaphore) if you want a lock that can yield to other processes when contended. The code you have written is the worst of both worlds in how it uses sched_yield. The call to sched_yield will ensure that the lock waits at least a few milliseconds (and probably a whole scheduling timeslice) in the case where there's both lock contention and cpu load, and it will burn 100% cpu when there's contention but no cpu load (due to the lock-holder being blocked in IO, for instance). If you want to get any of the benefits of a spin lock, you need to be spinning without making any syscalls. If you want any of the benefits of yielding the cpu, you should be using a proper synchronization primitive which will use (on Linux) futex (or equivalent) operations to yield exactly until the lock is available - no shorter and no longer.

And if by chance all that went over your head, don't even think about writing your own locks..

score 2 · Answer 3 · answered Jan 30 '12 at 18:49

2

You didn't look deep enough in the gcc documentation. The correct builtins for such type of lock are __sync_lock_test_and_set and __sync_lock_release. These have exactly the guarantees that you need for such a thing. In terms of the new C11 standard this would be the type atomic_flag with operations atomic_flag_test_and_set and atomic_flag_clear.

As R. already indicates, putting sched_yield into the loop, is really a bad idea.

If the code inside the critical section is only some cycles, the probability that the execution of it falls across the boundary of a scheduling slice is small. The number of threads that will be blocked spinning actively will be at most the number of processors minus one. All this doesn't hold if you yield execution as soon as you don't obtain the lock immediately. If you have real contention on your lock and yield, you will have a multitude of context switches, which will bring your system almost to a hold.

answered Jan 30 '12 at 18:49

Jens Gustedt

76,821
6
102
177

I don't see any compelling reason to use those primitives instead of compare-and-swap. They might be more efficient on a few really odd machines (nothing modern), but I'd rather ensure identical lock semantics across all archs and be able to do useful things like storing an owner id or other information in the atomic lock field itself. – R.. GitHub STOP HELPING ICE Jan 31 '12 at 02:03
The reason to use `atomic_flag` is simple. It is the only atomic type in C11 that is guaranteed to be lock-free. Thus it is the only such type that is guaranteed to be signal safe and not to block other threads when descheduled in the middle of operation. – Jens Gustedt Jan 31 '12 at 06:58
Well now you've brought in C11 standard atomics instead of GCC ones. As far as I know, all the GCC ones are guaranteed to actually be atomic, as they operate on objects of the unmodified type, not `_Atomic int` etc. which might be larger than the corresponding non-atomic types to admit an additional lock field. In any case, I would consider this a major quality of implementation issue that can be largely ignored. – R.. GitHub STOP HELPING ICE Jan 31 '12 at 07:19
There will never be a C11 implementation (or at least not a respectable one) where "atomics" actually have lock fields because they way to emulate real atomics on cpus that lack them is much more efficient; see the way atomic compare-and-swap is implemented on Linux ARM with the kernel helper. (This approach does not work for SMP, but any CPU designed for SMP usage will have a full set of working atomics.) – R.. GitHub STOP HELPING ICE Jan 31 '12 at 07:21
@R.. I think you are just wrong. In C11 `_Atomic` is a qualifier that applies to any data type. So an implementation must provide lock free versions of any data type. I don't see how you can do that without a lock field. And for ARM, as you give the example, this starts with implementing a lock-free version of `uint64_t`. Not all versions of that processor have a native instruction for that. And if you have to use a lock field, better use one with the correct memory ordering semantics. – Jens Gustedt Jan 31 '12 at 07:32
Indeed, I meant for system wordsize types and smaller, although really on 32-bit machines 64-bit types should be done the same as the ARM method... You're certainly right about structs and larger types in general, though. – R.. GitHub STOP HELPING ICE Jan 31 '12 at 07:35

score 0 · Answer 4 · answered Feb 08 '12 at 09:33

As others have pointed out its not really about how fast the locking code is. This is because once a lock sequence is initiated using "xchg reg,mem" a lock signal is sent down through the caches and out to the devices on all buses. When the last device has acknowledged that it will hold and acknowledged this - which may take hundreds of if not a thousand clocks cycles the actual exchange is performed. If your slowest device is a classic PCI card it will have a bus speed of 33 MHz which is about one hundredth of the CPU's internal clock. And the PCI device (if active) will need several clock cycles (@33 MHz) to respond. During that time the CPU will be waiting for the acknowledge to come back.

Most spinlocks are probably used in device drivers where the routine won't be pre-empted by the OS but might be interrupted by a higher-level driver.

A critical section is really just a spin-lock but with interfacing to the OS because it may be pre-empted.

Implementing critical section

4 Answers4

Linked