C code of lock-free queue

Question

How could I implement this lock-free queue pseudocode in C?

ENQUEUE(x)
    q ← new record
    q^.value ← x
    q^.next ← NULL
    repeat
        p ← tail
        succ ← COMPARE&SWAP(p^.next, NULL, q)
        if succ ≠ TRUE
            COMPARE&SWAP(tail, p, p^.next)
    until succ = TRUE
    COMPARE&SWAP(tail,p,q)
end

DEQUEUE()
    repeat
        p ← head
        if p^.next = NULL
            error queue empty
    until COMPARE&SWAP(head, p, p^.next)
    return p^.next^.value
end

How would be using the Built-in functions for atomic memory access

__sync_bool_compare_and_swap (type *ptr, type oldval type newval, ...)

I currently have

typedef struct queueelem {
    queuedata_t data;
    struct queueelem *next;
} queueelem_t;

typedef struct queue {
    int capacity;
    int size;
    queueelem_t *head;
    queueelem_t *tail;
} queue_t;

queue_t *
queue_init(int capacity)
{
    queue_t *q = (queue_t *) malloc(sizeof(queue_t));
    q->head = q->tail = NULL;
    q->size = 0;
    q->capacity = capacity;
    return q;
}

The only "convenience" of using global variables is that it lets you be lazy and avoid some typing. Since you're doing concurrency programming, you should be ashamed that you even asked that question. :-) — R.. GitHub STOP HELPING ICE, May 23 '11 at 02:37
It would help if you first implement a complete non-threadsafe variant first and try to make it thread safe later. As for global, read the comment above. — deepsnore, May 23 '11 at 02:41
How is this different (e.g. not a duplicate) of the last question you asked? — abelenky, May 23 '11 at 05:02
Note that names ending in `_t` are reserved in some standards (POSIX, not sure about the actual standard) and shouldn't be used in your code. You should use a different naming convention. — Chris Lutz, May 23 '11 at 06:07
possible duplicate of [Lock-free queue](http://stackoverflow.com/questions/6089029/lock-free-queue) — Jens Gustedt, May 23 '11 at 06:24
some rules to live by. 1) Always add/delete from the HEAD of the list. Faster! Simpler! A single CAS will add/remove items from list, and you eliminate a ton of race conditions. If you have to do more than one CAS, you will have to deal with race condtions. Changing the next ptr on the tail item and then changing the head or the tail at the same time is not possible with a single CAS. changing the head pointer of the list is! 2) only worry about ABA if you are removing an item and putting it back. If you add new ptrs to the list only, you wont have ABA. — johnnycrash, Jun 15 '11 at 20:11

score 15 · Accepted Answer · edited Apr 05 '21 at 22:21

15

https://www.liblfds.org/

Public domain, no license, portable implementation of lock-free algorithms in C.

Builds out of the box for Windows and Linux.

Uses GCC on Linux, so uses the intrinsics (well, apart from 128 bit CAS, there's no intrinsic - uses inline assembly for that).

Contains the M&S queue. Have a look at the source code and see how it's done.

edited Apr 05 '21 at 22:21

Levi Morrison

19,116
7
65
85

answered May 26 '11 at 15:01

Looks like the site is down – Schneems Jun 13 '17 at 18:37

score 4 · Answer 2 · edited May 23 '17 at 10:29

If your goal is production code, simply don't do that; use locks.

In your previous question, you have got enough information explaining why. Correct lock-free implementations of even simple data structures such as queue and stack in the absence of garbage collector are tricky and sophisticated due to the (in)famous ABA problem. Unfortunately some research papers do not take ABA into account for whatever reasons; your pseudo-code seems taken from one of such papers. If you translate it to C and use heap allocated memory for nodes, it will cause undeterministic bugs if used in real code.

If you are doing this stuff to gain experience, then don't expect SO fellows to solve it for you. You have to read all the cited materials and much more, make sure you really understand all nuances of lock-free algorithms such as ABA, study various techniques intended to address the issue, study existing lock-free implementations, etc.

Finally, little guidance for translating the given pseudo-code into C:

q^.value ← x means q_elem->data = x;
repeat ... until COMPARE&SWAP(head, p, p^.next) is equivalent to do {...} while (!__sync_bool_compare_and_swap(q_obj->head, q_elem, q_elem->next);

where q_obj is an instance of type queue_t (i.e. a queue) and q_elem is an instance of type queueelem_t (i.e. a queue node).

So what do I do if I need more performance in production? IMHO Locks are overkill for almost every race condition. — johnnycrash, Jun 15 '11 at 20:07
If I can get a lock free queue I'd prefer getting that, especially in production code. — Clearer, Jul 05 '17 at 08:54
You will need to go lock-free if you have a hard realtime requirement. It depends on the application. — E.M., Sep 01 '22 at 00:55

score 0 · Answer 3 · answered May 23 '11 at 05:03

While not exactly C, check out the proposed Boost.Lockfree library. The internals are pretty easy to grok and could be ported to C, or, conversely, you could wrap Boost.Lockfree in a C API and use that.

Similarly, Boostcon 2010 had lots of discussion about lockfree programming and STM which is worth looking at if you're interested in this subject. I can't find a link to the videos, but the talks from Intel, IBM and AMD were worth watching since they're dealing with STM at the CPU level.

score 0 · Answer 4 · answered May 23 '11 at 05:30

0

It sounds like what you want is called an MCS queue lock (although deceptively named, it's really lock-free, just not wait-free), and there is some good pseudo-code available here: http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#mcs

answered May 23 '11 at 05:30

mattst88

1,462
13
21

It's not deceptively named because it implements a lock, i.e. a mutual exclusion algorithm. – Alexey Kukanov May 23 '11 at 06:43

Darkautism Nong · Answer 5 · 2018-09-04T06:41:57.110

-2

I use C to write a minimize lockfree queue implementation.

lfq.

It support many producer, many consumer.

edited Sep 04 '18 at 06:41

answered Feb 22 '17 at 01:17

Darkautism Nong

1
4

You don't need `asm volatile( "lfence" )` for a load barrier. On x86, `asm("" ::: "memory")` is sufficient. You seem to be confusing C++'s compile-time memory model with x86's runtime memory model. See [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/a/50780314). You really don't need any hand-rolled asm stuff or `volatile`, just use C11 `stdatomic`. – Peter Cordes Sep 04 '18 at 06:56
Also your writer and reader contend with each other for `ctx->count`. In `lfq_enqueue`, shouldn't `p->next = insert_node;` be inside the CAS loop or something? Your current CAS loop in that function could be optimized to an unconditional `ctx->tail = insert_node`, so it's probably not doing what you intend. The `mb()` is also unnecessary; you've only modified private objects at that point. – Peter Cordes Sep 04 '18 at 07:08
Anyway, writing your own lock-free queue is a good way to learn stuff, but yours doesn't look ready for other people to use it yet. (And normally you'd use a fixed-size circular buffer to avoid allocate/free on every queue operation. On some implementations, there's a single global free pool, so calloc/free contend with each other for access to the same data structure!) – Peter Cordes Sep 04 '18 at 07:25
How to use ctx-> tail = insert_node to prevent ABA problem? Could you show me any sample code? – Darkautism Nong Sep 05 '18 at 08:00
The mb() is line 120? If you comment out line 120, the writer will get insert_node befor you assigned insert_node into ctx->tail. I cannot test pass 1 hour test if i comment out line 120.I'm glad someone can help me to improve knowledge of lock-free structure. But i cannot remove these memory barrier without crash. – Darkautism Nong Sep 05 '18 at 08:08
This is my speed test: lmb() version real 0m12.473s user 1m24.386s sys 0m5.008s mb() version real 0m15.719s user 1m25.282s sys 0m3.778s mb version always spend more time. Could you tell me the reason why don't use lfence? – Darkautism Nong Sep 05 '18 at 08:19
You don't need `lfence` because you aren't doing `movntdqa` loads from video RAM (or other WC memory region). That's basically the only use memory-ordering use-case. I already linked [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/a/50780314) which explains this, and what you *do* need for an acquire-load in C vs. x86 asm. I don't see why you're using a load barrier in `inHP()` anyway. I could see maybe once outside the loop to make it an acquire, if you're rolling your own atomics with `volatile`. – Peter Cordes Sep 05 '18 at 08:31
`mfence` is slower than `lfence`, so of course it's slower with `mb()`. But you sometimes *do* need `mfence` for correctness (or better an `xchg` store). If your `mb()` is avoiding bugs, it's maybe from that function inlining into the caller and optimizing, not from runtime memory ordering with it in that position. None of the assignments before `mb()` in `lfq_enqueue` are to objects that other threads can *currently* read. So you just need to make sure you have release semantics on the store the publishes. See also https://en.wikipedia.org/wiki/Non-blocking_linked_list – Peter Cordes Sep 05 '18 at 08:46
I got crash if i comment out line lfq.c:8 lmb() + remove lfq.h:16 volatile. Even though you said it makes sense, but it's not work. – Darkautism Nong Sep 05 '18 at 08:47
I think your CAS in `enqueue` isn't actually useless; you're using it to do an atomic exchange. You're loading, then CAS with that value to make sure you have the correct old value. It would be much more efficient to just exchange directly to update `ctx->tail` to point at your new node, with an `xchg` instruction (there's a `__sync` builtin for that, too). But yeah, I think it is actually that simple to append to the end of a linked list, and it's ok to only update the `old_tail->next` pointer after updating `tail`. – Peter Cordes Sep 05 '18 at 08:51
If you get crashes from removing barriers, it's probably because something somewhere else isn't actually safe; maybe in `dequeue`. I haven't looked at it; it looks complicated and weird. IDK why you don't just `CAS(&ctx->head, old_head, next)` to atomically advance `head` and "claim" the old head for this thread, after checking inside the retry loop that `next` is non-null. CAS failure indicates that another reader won the race for that node. – Peter Cordes Sep 05 '18 at 08:57
You said old commit? https://github.com/darkautism/lfqueue/blob/a4bb342349c5b1184c00aa6c18c8759ccd5cfa47/lfq.c#L53 Sorry, this version code not really lock free queue. Because you would swap empty node out and must swap in at line 56. This version code actually is a mini spin lock. So `CAS(&ctx->head, old_head, next)` is not really lock-free queue. – Darkautism Nong Sep 05 '18 at 09:03
This code use CAS zero to prevent swap out empty struct. If you just use `CAS(&ctx->head, old_head, next)`, you will swap out empty struct. – Darkautism Nong Sep 05 '18 at 09:37

C code of lock-free queue

5 Answers5