0

I am wondering why it would be incorrect to implement this sort of queue the naive way:

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>


void *print_message_function( void *ptr );

void *reader( void *ptr );
void *writer( void *ptr );



int queue[500000];


int main(int argc, char **argv) 
{
   pthread_t thread1, thread2;

   char *message1 = "Thread 1";
   char *message2 = "Thread 2";
   int  iret1, iret2; 


   iret1 = pthread_create( &thread1, NULL, writer, (void*) message1);
   iret2 = pthread_create( &thread2, NULL, reader, (void*) message2);

   usleep(2000);

   void pthread_exit(void *iret1 );
   void pthread_exit(void *iret2 );

   exit(0);

}



void *writer( void *ptr )
{
  // make local copy of queue head
  register int *pos = queue; 

  //   struct thread_param *tp = arg;
  int counter = 0;

  while(1)
  {
    //Write to head of queue
    *pos = 5;

    pos++;

    print_message_function(  ptr);
  }
}


void *reader( void *ptr )
{
  int counter = 0;

  // make local copy of queue head
  register int *pos = queue; 

  while(1)
  {

    // Read from tail of queue - loop when nothing
    if ( *pos == 5 ) 
    { 
      print_message_function( ptr ); 
      pos++; 
    }
  }
}



void *print_message_function( void *ptr )
{
      char *message;
      message = (char *) ptr;
      printf("%s \n", message);
}

I do intend to cache align the queue.

I do not believe memory reordering is a problem since a copy of the queue head is made on start and there is a single reader and writer.

My reason for wanting this is it should be faster than either mutex locks or CAS ops.

reuben
  • 3,360
  • 23
  • 28
BAR
  • 15,909
  • 27
  • 97
  • 185
  • Is this intended to be portable C code or is it platform-specific? If so, what platform is it supposed to work on? – David Schwartz Jul 07 '12 at 18:04
  • Its platform-independent linux, I can code custom assembly if need be. – BAR Jul 07 '12 at 18:07
  • That's kind of an oxymoron. Is it Linux-specific? Or is platform-independent? – David Schwartz Jul 07 '12 at 18:08
  • Sorry.. its platform-specific. – BAR Jul 07 '12 at 18:11
  • the reader is doing busy-wait... – Karoly Horvath Jul 07 '12 at 18:13
  • @KarolyHorvath: Yep. Which wastes power, can starve other cores, and forces a horrible performance penalty at the most critical time -- when you exit the loop. – David Schwartz Jul 07 '12 at 18:17
  • I should mention I also intend to do cpu-affinity so busy wait is not a problem, I want "faster than anything else". – BAR Jul 07 '12 at 18:17
  • @user417896: Then you have your design completely backwards. You are, at minimum, basically guaranteed a branch misprediction when you exit the loop. Those are definitely not fast. (Why are you doing this? Please just do it the normal way. Even CPU experts have a very hard time getting this right -- and then a new CPU comes out.) – David Schwartz Jul 07 '12 at 18:18
  • are `int` writes atomic on your CPU? – Karoly Horvath Jul 07 '12 at 18:19
  • @David what is the fastest way experts do it, surely not mutex. – BAR Jul 07 '12 at 18:22
  • 2
    Get your system working to a functional level with coventional blocking queues. If you have a performance issue, you MAY find that you have to look at some different queue class. You're only going to be queueing pointers, yes? You have to try quite hard to get contention when the queue is only locked for push/pop one pointer time. Overall, your app performance will probably be blown by the CPU-loops, as others have warned. Likewise CPU-affinity bodges. Would your time not be better spent on 'real' app code? – Martin James Jul 07 '12 at 18:29
  • With illegal instructions, catching the fault, and patching for the particular CPU/platform. Generally, you wind up with an atomic operation and a loop with a `rep nop` in it. (But please take Martin James advice -- you shouldn't be messing with this stuff. Do you really want to be writing the special case code for SMP Pentium Pro systems? For P4 hyper-threaded systems?) – David Schwartz Jul 07 '12 at 18:34
  • I would add that, If I found any app using up a complete core, or cores, while idle, I would immediately remove it from my systems and ask for my money back. – Martin James Jul 07 '12 at 18:41
  • By the way, those are real special cases. SMP PPro had an errata that required you to use a locked exchange instead of a plain write. And some P4's with hyper-threading will livelock if you read-spin on one virtual core while the other tries to write. If you want to write this kind of low-level code, you have to account for the details of the platform. (And generally, you just shouldn't do it.) – David Schwartz Jul 07 '12 at 19:10
  • @user417896 one way you can do it is to use a Futex/CS to protect the queue and, only if the queue is found empty within the F/CS, lock the consumer on a separate condvar/event for each consumer, (this allows a consumer to be made ready with an object already stored in consumer var so it doesn't have to revisit the queue). I rarely use such optimizations - I can get all I need for most apps with a bog-standard blocking P-C queue: a futex/CS and semaphore. With object pools, I don't even need bounded queues - that alone saves a pile of ring-cycles. No space-heaters/CPU-loops for me:) – Martin James Jul 07 '12 at 20:29

2 Answers2

2

with POSIX threads you only have data coherence between threads if you use mutexes, locks etc. And the coherence has no well defined interface with your compiler. (and volatile definitively isn't it) Don't do it like that, all things can happen, as updates of variables that are optimized out (here volatile could help) or partial reads or writes.

C11, the new C standard has a threading model that includes a data coherence model, thread creation functions and atomic operations. There is no compiler that implements this completely, it seems, but gcc or clang on top of POSIX threads implement the feature that you need. If you'd like to try this out and be future proof, P99 implements wrappers for these platforms that allow you to use the new C11 interfaces.

C11's _Atomic types and operations would be correct tools to implement lock free queues that operate between threads.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
1

In C, the volatile keyword has no defined semantics that apply when a variable is accessed concurrently in multiple threads (and pthreads doesn't add any). So the only way to know if it's safe or not is to look at the effects volatile has on particular platforms and compilers, figure out every possible way it could go wrong on those specific hardware platforms, and rule them out.

It's a really bad idea to do this if you have a choice. Portable code tends to be much more reliable. The two big problems are:

  1. New platforms do come out. And fragile code can break when a new CPU, compiler, or library is released.

  2. It's very hard to think of every way this could go wrong because you don't really know what you're working with. Mutexes, atomic operations, and the like have precisely-defined semantics for multiple threads, so you know exactly what guarantees you have -- on any platform, with any compiler, with any hardware.

Your reader code is terrible by the way. On hyper-threaded CPUs, for example, tightly spinning like that will starve the other virtual core. Worse, you could wind up spinning at FSB speed, starving other physical cores. And when you exit the spin loop -- the time performance is most critical -- you basically force a mispredicted branch! (The exact effects depends on the specifics of the CPU, which is another reason using this kind of code is bad. You need at least a rep nop.)

Community
  • 1
  • 1
David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • The thing is a variable is not accessed concurrently in multiple threads. Each has its own ptr to the queue. – BAR Jul 07 '12 at 18:13
  • You have the same problem. My point was that `volatile` doesn't ensure your code is safe. Without it, there's still nothing that makes it safe. – David Schwartz Jul 07 '12 at 18:20