0

This question is a follow up to this question. The summary is -- I had a server calling close() to finish off connections, but it seemed as if the shutdown sequence never occurred. The client continued to wait for more data. close() returned 0 in the server. Switching out my thread-safe queue from a conditional wait to semaphores solved the issue even though the conditional queue is correctly implemented. I'm posting my code to see if anyone has any illumination on these things for me.

condition-based queue:

  TASK *head;
  pthread_mutex_t mutex;
  pthread_cond_t cond;

  void init( ) {
    head = NULL;
    pthread_mutex_init(&mutex, NULL);
    pthread_cond_init(&cond, NULL);
  }

  TASK *get( ) {
    pthread_mutex_lock( &mutex );
    while(!head)
      pthread_cond_wait(&cond, &mutex);
    TASK *res = head;
    head = head->next;
    pthread_mutex_unlock( &mutex );
    return res;
  }

  void add( TASK *t ) {
    pthread_mutex_lock( &mutex );
    t->next = head;
    head = t;
    pthread_cond_broadcast( &cond );
    pthread_mutex_unlock( &mutex );
  }

I do realize this is a LIFO queue, and the next is a FIFO, but I've only included the interesting bits so it's quick and easy to read.

semaphore-based queue:

  TASK *buf;
  TASK *next_reader;
  TASK *next_writer;
  TASK *endp;
  pthread_mutex_t writer_mutex;
  pthread_mutex_t reader_mutex;
  sem_t writer_sem;
  sem_t reader_sem;

  void init( int num_tasks ) {
    buf = calloc(sizeof(TASK), num_tasks);
    next_reader = buf;
    next_writer = buf;
    endp = buf + num_tasks;
    sem_init(&writer_sem, 0, num_tasks);
    sem_init(&reader_sem, 0, 0);
    pthread_mutex_init(&writer_mutex, NULL);
    pthread_mutex_init(&reader_mutex, NULL);
  }

  TASK *get( ) {
    TASK *res = NULL;
    sem_wait(&reader_sem);
    pthread_mutex_lock(&reader_mutex);
    res = next_reader;
    next_reader++;
    if(next_reader == endp)
      next_reader = buf;
    pthread_mutex_unlock(&reader_mutex);
    sem_post(&writer_sem);
    return res;
  }

  void add( TASK *t ) {
    sem_wait(&writer_sem);
    pthread_mutex_lock(&writer_mutex);
    *next_writer = *item;
    next_writer++;
    if(next_writer == endp)
      next_writer = buf;
    pthread_mutex_unlock(&writer_mutex);
    sem_post(&reader_sem);
  }

I can't for the life of me see how the change from the condition queue to the semaphore queue would resolve the previous question I posted, unless there's some funky things happening if a thread is closing a socket and pthread_cond_broadcast is called during the close. I'm assuming an os bug because I can't find any documentation condemning what I'm doing. None of the queue actions are called from signal handlers. Here's my distro:

Linux version: 2.6.21.7-2.fc8xen

Centos version: 5.4 (Final)

Thanks

EDIT ---- I just added in the initializations I'm doing. In the actual code, These are implemented in a templated class. I've just included the relevant portions.

Community
  • 1
  • 1
DavidMFrey
  • 1,658
  • 1
  • 14
  • 14
  • Are the mutexes and condition variables properly intialized? They don't seem to be in your example. – Art Oct 29 '12 at 13:49
  • everything is initialized. I'll add that in. – DavidMFrey Oct 29 '12 at 14:02
  • Just checking. I've had pthread_mutex_lock fail before because the mutex was uninitialized and of course I didn't check the return values from it. – Art Oct 29 '12 at 14:13
  • @Art no, that was a good call, I probably would've suspected the same :) – DavidMFrey Oct 29 '12 at 14:14
  • @nos The servers using this code have been run through valgrind many times. There's no mem corruption showing up. – DavidMFrey Oct 29 '12 at 14:19
  • Actually. How about checking the return values from pthread_mutex* and see if that shows anything interesting? You never know. Also, how do you tear down the queue? In the semaphore version you can potentially leave the get function even though the queue is empty. In the condition variable function you can't get out of get (heh) unless there's something in the queue. IIRC tearing down mutexes and condition variables that something sleeps on is implementation defined. – Art Oct 29 '12 at 14:21
  • @Art The queue is for the life of the program. It's a server that runs continuously until killed by a SIGTERM. Then the consumer threads are simply cancelled and the queues are destroyed. I'm afraid our shutdown is a bit messy, but we're too busy building out new stuff to spend time on it... I can check the return values of the pthread calls. I'll post later if it shows anything of interest. – DavidMFrey Oct 29 '12 at 14:25
  • Ok. I read your other post and what you're seeing is really weird, especially since it's a once in 50k problem. I'd tell you what I tell our customers when they ask me to dig into weird threading problems that smell like something in the kernel - update your Centos version. What you're using is 3 years old a week ago. Linux is not bug free, especially not the things they do in Centos. – Art Oct 29 '12 at 14:41
  • @Art Thanks for the suggestion. I'll try to get the ops guys to upgrade sometime. I mainly was interested if anyone had any known issues like this, and to at least get the issue indexed by Google in case someone else runs up against it. Almost drove me crazy for a week! I'll repost if/when we get the servers upgraded. – DavidMFrey Oct 29 '12 at 14:49

0 Answers0