10

Is there any C++ implementation (source codes) of "optmistic approach to lock-free FIFO queues" algorithm?

peterh
  • 11,875
  • 18
  • 85
  • 108
uray
  • 11,254
  • 13
  • 54
  • 74

5 Answers5

11

Herb Sutter covered just such a queue as part of his Effective Concurency column in Dr. Dobbs Journal.

Writing Lock-Free Code: A Corrected Queue

greyfade
  • 24,948
  • 7
  • 64
  • 80
  • that is theory, what i'am asking is the implementation. is there any source codes or library that implement those algorithm? – uray May 31 '10 at 19:34
  • 4
    Did you even read the article? *All of page 2* is annotated source code. – greyfade May 31 '10 at 19:53
  • ok, sorry about that, i'am expecting it is wrapped as library or something... so I just include the source to my project and use it. is there any benchmark of this algorithm compared to the paper I refer above? – uray May 31 '10 at 20:28
  • Compared *to*, no. But there are benchmarks in his follow-up articles, linked from his website: EC #16 through EC #18. – greyfade May 31 '10 at 21:20
  • do you have those implementation of atomic ? – uray May 31 '10 at 22:17
  • No, that's something that's in C++0x. You can find libraries that have a similar type, or you can roll your own. Setting an `atomic<>` is a simple matter of a compare-and-swap operation, which is often provided as a compiler intrinsic. – greyfade Jun 01 '10 at 00:09
  • This question has some discussion on writing `atomic<>` and similar functions: http://stackoverflow.com/questions/1158374/portable-compare-and-swap-atomic-operations-c-c-library – greyfade Jun 01 '10 at 00:13
  • thx, I ended up writing my own atomic<>, I have one more question about CACHE_LINE_SIZE on my new post below.. – uray Jun 01 '10 at 00:27
  • 1
    Unfortunately Page 2 cannot be accessed due to some problem on drdoobs web page. – Mathieu Dutour Sikiric Mar 22 '21 at 11:38
4

I want to conclude the answer given by greyfade, which is based on http://www.drdobbs.com/high-performance-computing/212201163 (the last part of the article), the optimized code would be (with some modification to suit my naming and coding convention) : `

template <typename T> class LFQueue {
private:
    struct LFQNode {
        LFQNode( T* val ) : value(val), next(nullptr) { }
        T* value;
        AtomicPtr<LFQNode> next;
        char pad[CACHE_LINE_SIZE - sizeof(T*) - sizeof(AtomicPtr<LFQNode>)];
    };

    char pad0[CACHE_LINE_SIZE];
    LFQNode* first;                 // for one consumer at a time
    char pad1[CACHE_LINE_SIZE - sizeof(LFQNode*)];
    InterlockedFlag consumerLock;   // shared among consumers
    char pad2[CACHE_LINE_SIZE - sizeof(InterlockedFlag)];
    LFQNode* last;                  // for one producer at a time
    char pad3[CACHE_LINE_SIZE - sizeof(LFQNode*)];
    InterlockedFlag producerLock;   // shared among producers
    char pad4[CACHE_LINE_SIZE - sizeof(InterlockedFlag)];
public:
    LFQueue() {
        first = last = new LFQNode( nullptr ); // no more divider
        producerLock = consumerLock = false;
    }

    ~LFQueue() {
        while( first != nullptr ) {
            LFQNode* tmp = first;
            first = tmp->next;
            delete tmp;
        }
    }

    bool pop( T& result ) {
        while( consumerLock.set(true) ) 
        { }                             // acquire exclusivity
        if( first->next != nullptr ) {  // if queue is nonempty 
            LFQNode* oldFirst = first;
            first = first->next;
            T* value = first->value;    // take it out
            first->value = nullptr;     // of the Node
            consumerLock = false;       // release exclusivity
            result = *value;            // now copy it back
            delete value;               // and clean up
            delete oldFirst;            // both allocations
            return true;                // and report success
        }
        consumerLock = false;           // release exclusivity
        return false;                   // queue was empty
    }

    bool push( const T& t )  {
        LFQNode* tmp = new LFQNode( t );    // do work off to the side
        while( producerLock.set(true) ) 
        { }                             // acquire exclusivity
        last->next = tmp;               // A: publish the new item
        last = tmp;                     // B: not "last->next"
        producerLock = false;           // release exclusivity
        return true;
    }
};

`

another question is how do you define CACHE_LINE_SIZE? its vary on ever CPUs right?

uray
  • 11,254
  • 13
  • 54
  • 74
  • A good number to choose would be `64` bytes, I think. But you'll probably want to balance it with size, so I'd suggest looking at your target CPUs and choose a size that fits the most common ones you expect to target. – greyfade Jun 01 '10 at 01:54
  • 1
    Just a quick note: this is not a forum, so people can't be assume to "browse the thread". If you wish to ask another question, you should better use the "Ask Question" field rather than the "Your Answer" one. – Matthieu M. Jun 01 '10 at 06:26
  • I'am indeed re-answering the question, but i was wrong asking in the answer field, I should add new comment under my own new answer. sorry about that. – uray Jun 01 '10 at 11:51
  • I'am done benchmarking the above code against std::queue with CRITICAL_SECTION lock in windows, the lock-free queue is actually 2~3 times _slower_ than implementation of std::queue with lock. do you know why? it is because of linked list? – uray Jun 01 '10 at 11:53
  • sharing benchmark code including what system you're running would be useful here. Also what is the intended usage in your app, that's what matters. – Rick Jun 01 '10 at 20:40
  • 1
    Ouch. The cache line alignment hack is ugly. What happens when your code runs on a CPU with a different cache arrangement? also, that struct is consuming a LOT of cache. L1 cache is a scarce resource. I can understand it being done, otherwise you physical bond entities which are logically separate, but still - ouch. Expensive. –  Jun 12 '12 at 08:56
1

Here is my implementation of a lock-free FIFO.

Make sure each item of T is a multiple of 64 bytes (the cache line size in the Intel CPUs) to avoid false sharing.

This code compiles with gcc/mingw and should compile with clang. It's optimized for 64-bit, so to get it to run on 32-bit would need some refactoring.

https://github.com/vovoid/vsxu/blob/master/engine/include/vsx_fifo.h

vsx_fifo<my_struct, 512> my_fifo;

Sender:

my_struct my_struct_inst;
... fill it out ...
while (!my_fifo.produce(my_struct_inst)) {}

Receiver:

my_struct my_struct_recv;
while(my_fifo.consume(my_struct_recv)) 
{ 
  ...do stuff...
}
jaw
  • 31
  • 3
1

How about this lfqueue

This is cross-platform, unlimited enqueue thread safety queue, have been tested multi deq, multi enq-deq and multi enq. Guarantee memory safe.

For example

int* int_data;
lfqueue_t my_queue;

if (lfqueue_init(&my_queue) == -1)
    return -1;

/** Wrap This scope in other threads **/
int_data = (int*) malloc(sizeof(int));
assert(int_data != NULL);
*int_data = i++;
/*Enqueue*/
 while (lfqueue_enq(&my_queue, int_data) == -1) {
    printf("ENQ Full ?\n");
}

/** Wrap This scope in other threads **/
/*Dequeue*/
while  ( (int_data = lfqueue_deq(&my_queue)) == NULL) {
    printf("DEQ EMPTY ..\n");
}

// printf("%d\n", *(int*) int_data );
free(int_data);
/** End **/

lfqueue_destroy(&my_queue);
Oktaheta
  • 606
  • 5
  • 21
0

If you're looking for a good lock free queue implementation both Microsoft Visual Studio 2010 & Intel's Thread Building Blocks contain a good LF queue which is similar to the paper.

Here's a link to the one in VC 2010

Rick
  • 3,285
  • 17
  • 17
  • I try the vs2010 one and benchmarked, it is faster than "std::queue with lock" on small data sets, but exponentialy slower on large dataset – uray Jun 01 '10 at 17:05