Thread synchronization issue: possible race, misuse of volatile, cache coherency?

Question

I am getting acquainted with pthreads programming; the following code is a simple producer/consumer design where data is put/fetched from a global list. The problem is: the data pointer in the consumer function is tried to be freed twice. Beside, if I add a printf() at the beginning of the loop, everything seems ok... What am I doing wrong? I suspect a misuse of the volatile keyword, or something hidden by the cache... Unless it's just a design issue (which probably is :p).

Thanks for your insights.

Note: malloc()/free() is thread-safe on my system. I am compiling with $ gcc -pthread -O0 which should, I guess, reduce possible design errors due to misuse of volatile. Finally, we don't care in this code snippet with running out of memory (in case of more producer than consumer).

Edit: Changed code to a single list head.

#include <pthread.h>
#include <stdlib.h>
#include <string.h>

pthread_mutex_t lock;
pthread_cond_t new_data;

struct data {
  int i;
  struct data *next;
};

struct data *list_head = NULL;

void *consume(void *arg)
{
  struct data *data;

  while (1) {
    pthread_mutex_lock(&lock);
    while (list_head == NULL) {
      pthread_cond_wait(&new_data, &lock);
    }
    data = list_head;
    list_head = list_head->next;
    pthread_mutex_unlock(&lock);
    free(data);
  }

  return NULL;
}

void *produce(void *arg)
{
  struct data *data;

  while (1) {
    data = malloc(sizeof(struct data));
    pthread_mutex_lock(&lock);
    data->next = list_head;
    list_head = data;
    pthread_mutex_unlock(&lock);
    pthread_cond_signal(&new_data);
  }

  return NULL;
}

int main()
{
  pthread_t tid[2];
  int i;

  pthread_mutex_init(&lock, NULL);
  pthread_cond_init(&new_data, NULL);
  pthread_create(&tid[0], NULL, consume, NULL);
  pthread_create(&tid[1], NULL, produce, NULL);
  for (i = 0; i < 2; i++) {
    pthread_join(tid[i], NULL);
  }
}

And the output:

$ ./a.out 
*** glibc detected *** ./a.out: double free or corruption (fasttop): 0x00007f5870109000 ***

You should remove all use of `volatile` in this code. The lock generates the correct memory barriers and such for your platform -- `volatile` is for communication with memory mapped devices and such, not for synchronization. `volatile` prevents reordering or read/write removal by the compiler, but not by the CPU. — Billy ONeal, Jan 28 '13 at 20:25
As an aside, I'd also swap the `pthread_cond_signal` and `pthread_mutex_unlock`. — user7116, Jan 28 '13 at 20:55
@sixlettervariables I have seen both permutations on different code samples, I can't figure out what's the best practice... I guess that's because the lock is still hold and even though the waiting thread is released from waiting, it still won't be able to get the lock since it has not been released yet? — Benny, Jan 28 '13 at 21:07
I believe you should unlock then signal. Regardless, your example is *LIFO* but I'll bet you want *FIFO* semantics, am I correct? — user7116, Jan 28 '13 at 21:12
@BillyONeal The `volatile` was a qualifier to the pointer type, not to the storage. In `consume()`, `data = list_head;`, are we sure that the compiler will issue code that fetch `list_head` from memory, since `list_head` pointer will be modified outside this this loop? I mean instead of keeping it into a register? — Benny, Jan 28 '13 at 21:15
@sixlettervariables I don't care at this point, I have changed the code to LIFO for the sake of simplicity. — Benny, Jan 28 '13 at 21:20
@Benoit: No, the lock itself is supposed to force the compiler to emit a load from memory for that. — Billy ONeal, Jan 28 '13 at 21:33
@BillyONeal Thanks for the quick answer, do you have any references about that? — Benny, Jan 28 '13 at 21:38
@Benoit: http://stackoverflow.com/questions/3208060/does-guarding-a-variable-with-a-pthread-mutex-guarantee-its-also-not-cached <-- Basically, PThreads emits the memory barrier the CPU needs. A compiler can't cache anything stored in a global variable across call to another function, because that other function may modify that global and the compiler does not know that. (if a compiler implements inlining here, then caching the global would have observable side effects and is thus forbidden) — Billy ONeal, Jan 28 '13 at 23:20
@Benoit: Signalling the condition variable either before or after the unlock makes no difference in terms of correctness. Signalling it with the lock held is often considered "better" because it ensures that if a higher-priority thread is waiting on the condition variable then a lower-priority thread won't be able to jump in ahead of it (in the window between the unlock and the signal). If you're not using thread priorities, then this won't be a concern. — caf, Jan 30 '13 at 12:30

user7116 · Accepted Answer · 2013-01-28T22:57:27.657

Consider the following scenario:

produce -> acquired lock
consume -> wait lock
produce -> allocate d0, write_ptr = d0, read_ptr = d0, signal, unlock
consume -> acquired lock
produce -> wait lock
consume -> satisfied condition
consume -> data = d0, read_ptr = NULL, unlock
consume -> free d0
produce -> acquired lock, allocate d1
consume -> wait lock
produce -> write_ptr != null so write_ptr->next = d1
produce -> read_ptr == null so read_ptr = d1

Check out step 11. write_ptr is still d0 even though it was free'd independently of the producer thread. You need to make sure consume does not invalidate write_ptr.

A doubly linked list would allow you to avoid some of these difficulties (as the readers and writers work from separate ends).

Main:

Create sentinel nodes HEAD and TAIL, link HEAD and TAIL
Spawn producer
Spawn consumer

Producer:

Lock
Create node
Link HEAD->next->prev to node
Link node->next to HEAD->next
Link HEAD->next to node
Link node->prev to HEAD
Unlock
Signal

Consumer:

Lock
Wait for TAIL->prev != HEAD (do { pthread_cond_wait } while (condition);)
data = TAIL->prev
Link TAIL->prev to data->prev
Link TAIL->prev->next to TAIL
Unlock
Use data

You are right. That's the problem. He's never setting write_ptr back to NULL again. But wouldn't this behavior be hidden by the bug doron and me pointed out? — junix, Jan 28 '13 at 20:49
Either way he'll run into using memory that is gone. Which at some point corrupts. — user7116, Jan 28 '13 at 20:50
Well that's true. Funny though this results in such "defined" undefined behavior :-) — junix, Jan 28 '13 at 20:51

score 4 · Answer 2 · answered Jan 28 '13 at 20:30

4

I believe that your problem is in the following lines:

if (NULL != write_ptr)
  write_ptr->next = data;
write_ptr = data;
if (NULL == read_ptr)
  read_ptr = data;

I don't think you are building your list correctly. In fact, I don't understand why you have two lists. But anyway...

I assume that you want to add your new data to the head of the list. Otherwise you would need a tail pointer or you would need to chase to the end of the list each time.

To do this you need to add the current list head as the next item of your data. It should look like:

data->next = write_ptr;
write_ptr = data;

There is no need for a NULL check on write_ptr.

answered Jan 28 '13 at 20:30

Zan Lynx

53,022
10
79
131

That wasn't two lists, but two pointers running on the same list. Anyway you're right, my LL implementation was flawed. I've changed it, although it's now LIFO instead of FIFO. – Benny Jan 28 '13 at 21:34
@Benoit: Oh, I see now. The write pointer is supposed to be the end of the list while the read pointer is the head of the list. In which case my suggestion is wrong. – Zan Lynx Jan 28 '13 at 23:05

score 1 · Answer 3 · edited May 23 '17 at 12:04

Update: As Billy ONeal pointed out, pthread functions provide the needed memory barriers, so it's not necessary to declare anything volatile as long as it's protected by pthread locks. (See this question for details: Does guarding a variable with a pthread mutex guarantee it's also not cached?)

But I got another explaination for some odd behavior: Your linked list the producer creates is broken: Assume write_ptr is not NULL and observe the behavior:

/* 1 */ if (NULL != write_ptr)
/* 2 */   write_ptr->next = data;
/* 3 */ write_ptr = data;

Say write_ptr points to previously allocated instance A. A.next is NULL. We newly allocate an instance B and set everything to NULL (therefore: B.next is NULL).

Line 1 evaluates true, therefore line 2 is executed. A.next is now pointing to B. After executing line 3, write_ptr points to B B.next is NULL => A is lost what results into a memory leak.

But for now I can't see why the libc complains about double free.

But it is protected by a lock. I think all uses of `volatile` in this code should be removed. The lock should generate the requisite memory barriers required. — Billy ONeal, Jan 28 '13 at 20:17
Correct. Volatile adds nothing when locks are used properly. — Zan Lynx, Jan 28 '13 at 20:27
@BillyONeal You are perfectly wright. I was mistaken. See my updated post. — junix, Jan 28 '13 at 20:40

score 1 · Answer 4 · answered Jan 28 '13 at 20:44

The error is in the lines

if (NULL != write_ptr)
   write_ptr->next = data;
write_ptr = data;

This should read: if (NULL != write_ptr) data->next = write_ptr; write_ptr = data;

For debugging purposes also make sure that you are getting your expected values out the queue.

Also there is no need for the volatile variables. since your queue is protected by mutexes, the operating system will insure that queue access is atomic. volatile is only needed when accessing memory mapped hardware resources and should never be used for synchronization. All it does is forces data to memory unnecessarily.

There is also one other issue. If your producer is faster than your consumer, you will eventually run out of memory unless you limit the size of the queue and force the producer to wait for the consumer. Smaller queues are more responsive and the only reason to go for larger queues is to smooth out perturbations in the producer.

That change the FIFO design to LIFO, but anyway that's fixing my issue. I changed my LL implementation. Thanks. — Benny, Jan 28 '13 at 21:30

Thread synchronization issue: possible race, misuse of volatile, cache coherency?

4 Answers4