4

Below is a simplified C program that demonstrates a problem I am having with a concurrent stack implemented using the GNU built in compare and swap on an intel cpu. It took me a while to understand what was happening, but now that I do I see that it is well within the guarantees provided by atomic compare and swap.

When a node is popped from the stack, modified, then placed back on the stack, the modified value may become the new head of the stack, corrupting it. The comments in test_get describe the order of events that cause this.

Is there any way to reliably use the same node with the same stack more than once? This is an exaggerated example but even returning an unmodified node to gHead could leak other nodes. The original point of this data structure was to repeatedly use the same nodes.

typedef struct test_node {
    struct test_node *next;
    void *data;
} *test_node_p;

test_node_p gHead = NULL;
unsigned gThreadsDone = 0;

void test_put( test_node_p inMemory ) {
    test_node_p scratch;

    do {
        scratch = gHead;
        inMemory->next = scratch;
    } while ( !__sync_bool_compare_and_swap( &gHead , scratch , inMemory ) );
}

test_node_p test_get( void ) {
    test_node_p result;
    test_node_p scratch;

    do {
        result = gHead;
        if ( NULL == result ) break;
        //  other thread acquires result, modifies next
        scratch = result->next;     //  scratch is 0xDEFACED...
        //  other thread returns result to gHead
    } while ( !__sync_bool_compare_and_swap( &gHead , result , scratch ) );
    //  this thread corrupts gHead with 0xDEFACED... value

    if ( NULL == result ) {
        result = (test_node_p)malloc( sizeof(struct test_node) );
    }

    return result;
}

void *memory_entry( void *in ) {
    test_node_p node;
    int index , count = 1000;
    for ( index = 0 ; index < count ; ++index ) {
        node = test_get();
        *(uint64_t *)(node) |= 0xDEFACED000000000ULL;
        test_put( node );
    }

    __sync_add_and_fetch(&gThreadsDone,1);

    return NULL;
}

void main() {
    unsigned    index , count = 8;
    pthread_t   thread;

    gThreadsDone = 0;

    for ( index = 0 ; index < count ; ++index ) {
        pthread_create( &thread , NULL , memory_entry , NULL );
        pthread_detach( thread );
    }

    while ( __sync_add_and_fetch(&gThreadsDone,0) < count ) {}
}

I am running this test with 8 logical cores. When I use only 4 threads the problem rarely occurs but with 8 it is easily reproducible. I have a MacBook with an Intel Core i7.

I am not interested in using some library or framework that has solved this, I want to know how it was solved. Thank you.

Edit:

Here are two solutions that do not work in my case.

Some architectures provide ll/sc instruction pairs that perform atomic tests on the address instead of the value. Any write to the address, even of the same value, prevents success.

Some architectures provide larger than pointer size compare and swap. With this, a monotonic counter is paired with the pointer which is atomically incremented every time it is used so the value always changes. Some intel chips support this but there is no GNU wrapper.

Here is a play by play of the problem. Two threads, A and B, reach the point in test_get where they have the same value for result and it is not NULL. Then the following sequence occurs:

  1. Thread A passes the compare and swap and returns result from test_get
  2. Thread A modifies the content of result
  3. Thread B dereferences result, getting whatever thread A put there
  4. Thread A finishes with result and calls test_put
  5. Thread A passes the compare and swap in test_put putting result back on gHead
  6. Thread B reaches the compare and swap in test_get and passes
  7. Thread B has now corrupted gHead with the value from Thread A

Here is a similar scenario with three threads that does not require Thread A to modify result.

  1. Thread A passes the compare and swap and returns result from test_get
  2. Thread A does not modify the content of result
  3. Thread B dereferences result, getting a valid value in scratch
  4. Thread C calls test_put with an unrelated value and succeeds
  5. Thread A calls test_put and succeeds in putting result back on gHead
  6. Thread B reaches the compare and swap in test_get and passes
  7. Thread B has now corrupted gHead by leaking whatever thread C added

In either scenario, the problem is that thread A passes the compare and swap twice, once for get then again for put, before thread B reaches the compare and swap for get. Any value thread B has for scratch should be discarded, but is not because the value in gHead appears correct.

Any solution that makes this less likely without actually preventing it just makes the bug harder to track down.

Note that the scratch variable is just a name for wherever the dereferenced value of result is placed before the atomic instruction begins. Removing the name does remove the time slice between dereference and compare that may be interrupted.

Note also that atomic means it succeeds or fails as a whole. Any aligned read of a pointer is implicitly atomic on the hardware in question. As far as other threads are concerned, there is no interruptible point where only half the pointer has been read.

drawnonward
  • 53,459
  • 16
  • 107
  • 112

3 Answers3

4

Never access an atomic variable through a simple evaluation. Also, for a compare and swap loop like yours, __sync_val_compare_and_swap is more convenient, I think.

/* read the head atomically */
result = __sync_val_compare_and_swap(&gHead, 0, 0);
/* compare and swap until we succeed placing the next field */
while ((oldval = result)) {
   result = __sync_val_compare_and_swap(&gHead, result, result->next);
   if (oldval == result) break;
}
Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
  • depending on the architecture you could have read two halves of the pointer from different instances, so the pointer could be completely invalid. – Jens Gustedt Jun 22 '13 at 07:28
  • That doesn't seem like a good architecture that can't load a whole pointer in a single read. :\ – xaxxon Jun 22 '13 at 07:29
  • @jxh while I suppose this is important, yo nailed the logical problem with the original code. I'll go vote some of your other answers :) – xaxxon Jun 22 '13 at 07:34
  • @drawnonward, I think you are right in your comment to the now deleted answer. I'd leave my answer here for reference for a while, by please transform your comment there to an answer. – Jens Gustedt Jun 22 '13 at 07:36
  • the problem is not even the architecture but even you compiler. unless that CAS is explicit, that read can happen or not about any time. – starmole Jun 22 '13 at 07:39
  • @xaxxon: Thanks, I've returned the favor. – jxh Jun 22 '13 at 08:15
2

(I am discarding my previous answer.)

The issue is that you do not have a mechanism to atomically read gHead and gHead->next, but such is needed to achieve your lock free stack. Since you intend to busy loop anyway to deal with compare and swap collisions, you can use the equivalent of a spin lock:

void lock_get () {
    while (!_sync_bool_compare_and_swap(&gGetLock, 0, 1)) {}
}

void unlock_get () {
    unlock_get_success = _sync_bool_compare_and_swap(&gGetLock, 1, 0);
    assert(unlock_get_success);
}

Now, the loop in test_get() can be surrounded by lock_get() and unlock_get(). The CAS loop of test_get() is just one thread contending with test_put(). Jens' implementation of the CAS loop seems cleaner though.

lock_get();
result = __sync_val_compare_and_swap(&gHead, 0, 0);
while ((oldval = result)) {
   result = __sync_val_compare_and_swap(&gHead, result, result->next);
   if (oldval == result) break;
}
unlock_get();

This implements the intention, which is that only one thread should be popping off the head.

jxh
  • 69,070
  • 8
  • 110
  • 193
  • which function does that loop go in? – xaxxon Jun 22 '13 at 07:13
  • 1
    Your are supposing that the read of the variable that is done in the `while` condition is atomic. You can't. Your dereffence to obtain the `next` field then may be invalid. – Jens Gustedt Jun 22 '13 at 07:16
  • 1
    @JensGustedt I don't think that's true. If you aren't the first one to hit the sync_bool... then you won't do the result->next dereference, because the previous one has moved the head pointer to the next entry - making &ghead and result not equal. That bypasses the result->next dereference. – xaxxon Jun 22 '13 at 07:17
  • +1 for each of you, that's the case the asker is describing in the comments. If the second thread pulls the head and then puts it back, the read is invalidated. This is a real problem if there was a 3rd thread that did a put in the meantime. – jxh Jun 22 '13 at 07:19
  • 1
    Your code looks a little different but would compile about the same. Two threads get the same `result` off `gHead` in `test_get`. The first thread passes the compare and swap barrier, and modifies `result->next`. The second thread reads `result->next` into `scratch`. The first thread puts `result` back on `gHead`, passing the compare and swap barrier in `test_put`. The second thread finally reaches the compare and swap barrier in `test_get`. It has the same `result` it started with that has now been returned to `gHead`. It passes the compare and swap barrier, corrupting `gHead`. – drawnonward Jun 22 '13 at 07:30
  • 1
    @drawnonward: Yes, I had realized that after I thought about Jens' and xaxxon's comments. I think you need a spin lock. – jxh Jun 22 '13 at 08:52
  • The spin lock prevents two threads from using the same value from gHead at the same time. – drawnonward Jun 22 '13 at 17:30
1

If you have a CAS variable (in your case gHead). You have to always use CAS to access it. Or protect it with a lock. For reading and writing. Stuff like "result = gHead;" is a big no-no.

Re-reading your question, a LIFO is a stack. Implementing any CAS based data structure is based on having only ONE thing to change. In a stack that is top-of-stack. You seem to be doing a linked list. I am sure there are cool ways to do an atomic linked list.

But for a stack, do one stack pointer, like everybody else :)

starmole
  • 4,974
  • 1
  • 28
  • 48