3

What is the most efficient way to create a thread with argument? The argument is a struct, if the struct can not stay on parent thread stack, there are two solutions.

With dynamic memory allocation

struct Arg{
    int x;
    int y;
};

void* my_thread(void* v_arg){
    Arg* arg = (Arg*) v_arg;

    //... running

    delete arg;
    return NULL;
}

//Creating a thread
void a_function(){
    Arg* arg = new Arg;
    arg->x = 1; arg->y = 2;

    pthread_t t;
    pthread_create(&t, NULL, my_thread, arg);
    pthread_detach(t);
}

With semaphore

struct Arg{
    sem_t sem;
    int x;
    int y;
};

void* my_thread(void* v_arg){
    Arg* arg = (Arg*) v_arg;
    int arg_x = v_arg->x;
    int arg_y = v_arg->y;
    sem_post( &(v_arg->sem) );

    //... running

    return NULL;
}

//Creating a thread
void a_function(){
    Arg arg;
    arg.x = 1; arg.y = 2;
    sem_init( &(arg.sem), 0, 0);

    pthread_t t;
    pthread_create(&t, NULL, my_thread, &arg);
    pthread_detach(t);

    sem_wait( &(arg.sem) );
    sem_destroy( &(arg.sem) );
}

I work with Linux and Windows.

mbq
  • 18,510
  • 6
  • 49
  • 72
Squall
  • 4,344
  • 7
  • 37
  • 46
  • The snippet indicates you're using some C-style language, and I infer it's C from the fact that you're programming in Linux, but I can't tell that for sure. As it is highly relevant to the question, could you tag the question with the language used, and/or provide some more specifics? – KeithS Feb 17 '11 at 17:57
  • 1
    @KeithS: This definitely looks like C or C++ on Linux (he's using `pthreads`), but similar code can be used for Windows and from what I recall his question is valid on both platforms. – Argote Feb 17 '11 at 18:04
  • 1
    It is C++, but C may be used. There is pthreads for Windows. I used pthreads as an example. – Squall Feb 17 '11 at 18:17

3 Answers3

2

In the code you have posted, the most efficient implementation is using heap allocation (your first example). The reason for this is that heap allocation (using new() or malloc) is much cheaper than context switching. Consider what need to happen in your second example:

  1. Allocate stack space for Arg
  2. Initialize the semphore
  3. Start the thread and switch context
  4. Copy the variables to the new stack
  5. Switch context back
  6. Destroy semaphore
  7. Detach thread
  8. Switch context

Alternatively, your first example:

  1. Allocate heap space for Arg
  2. Start the thread
  3. Detach thread
  4. Switch context
Seth
  • 2,683
  • 18
  • 18
  • Something's not quite right about this answer. If the machine is multicore/SMP, the first example will not involve any context switches, just some atomic ops and maybe a little bit of spinning in the "parent" thread (it shouldn't spin long enough to actually go to sleep). On the other hand, `malloc` involves acquiring and releasing locks on global state. Even if you have an arena-based/threadcache-based allocator, this pattern of allocating memory in one thread and 'donating' it to another thread will lead to worse contention behavior and possibly bad memory fragmentation. – R.. GitHub STOP HELPING ICE Feb 17 '11 at 23:06
  • @R - In principle, I think you have a point. However, for the code as posted, malloc (in the glibc implementation as well as others) involves only atomic operations on the fastbins as the structure is less than 64 bytes. There is no locking. – Seth Feb 18 '11 at 00:25
  • This answer is low performance compared to my solution below. – johnnycrash Apr 07 '11 at 20:57
0

It depends. If your struct isn't big it's better to allocate it dynamically in order to minimize odd synchronization calls. Otherwise if your structure is quite big and you write code for a system with little memory, it's better to use semaphore(or even condvar).

Dan Kruchinin
  • 2,945
  • 1
  • 17
  • 21
0

Atomic op solution. This is a very high speed way to get memory for your arguments.

If the arguments are always the same size, pre-allocate a bunch of them. Add a pNext to your struct to link them together. Create a _pRecycle global to hold all the available ones as a linked list using pNext to link them. When you need an argument, use atomic ops to CAS one off the head of the trash list. When you are done use atomic ops to put the arg back at the head of the trash list.

CAS refers to something like __sync_bool_compare_and_swap which returns 1 on success.

to grab argument memory:

while (1)  {  // concurrency loop
  pArg = _pRecycle;  // _pRecycle is the global ptr to the head of the available arguments
  // POINT A
  if (CAS(&_pRecycle, pArg->pNext, pArg))  // change pRecycle to next item if pRecycle hasn't changed.
    break; // success
  // POINT B
}
// you can now use pArg to pass arguments

to recycle argument memory when done:

while (1)  {  // concurrency loop
  pArg->pNext = _pRecycle;
  if (CAS(&_pRecycle, pArg, pArg->pNext))  // change _pRecycle to pArg if _pRecycle hasn't changed.
    break; // success
}
// you have given the mem back 

There is a race condition if something uses and recycles pArg while another thread is swapped out between point A and B. If your work takes a long time to process this wont be a problem. Otherwise you need to version the head of the list... To do that you need to be able to atomically change two things at once... Unions combined with 64 bit CAS to the rescue!

typedef union _RecycleList {
  struct {
    int   iversion;
    TArg *pHead;
  }
  unsigned long n64;  // this value is iVersion and pHead at the same time!
} TRecycleList;

TRecycleList _Recycle;

to get mem:

while (1)  // concurrency loop
{
  TRecycleList Old.n64 = _Recycle.n64;
  TRecycleList New.n64 = Old.n64;
  New.iVersion++;
  pArg = New.pHead;
  New.pHead = New.pHead->pNext;
  if (CAS(&_Recycle.n64, New.n64, Old.n64)) // if version isnt changed we get mem
    break; // success
}

to put mem back:

while (1)  // concurrency loop
{
  TRecycleList Old.n64 = _Recycle.n64;
  TRecycleList New.n64 = Old.n64;
  New.iVersion++;
  pArg->pNext = New.pHead;
  New.pHead = pArg;
  if (CAS(&_Recycle.n64, New.n64, Old.n64))  // if version isnt changed we release mem
    break; // success
}

Since 99.9999999% of the time no two threads will be executing the code to grab memory at the same time, you get great performance. Our tests have shown CAS to be as little as 2x as slow as just setting _pRecycle = pRecycle->pNext. 64 bit and 128 bit CAS are just as fast as 32 bit. Basicly it screams. Every once in awhile the concurrency loop will execute twice when two threads actually race. One always will win, so the race resolves very fast.

johnnycrash
  • 5,184
  • 5
  • 34
  • 58