-1

I have a project that I have summarized here with some pseudo code to illustrate the problem. I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.

The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.

Here is a simplified version of the problem I am facing:

class aproject(){

public:

typedef struct
{
char** somedata;
double output,fitness;
}entity;

entity **entity_array;

int whichthread,numthreads;
pthread_mutex_t mutexdata;



aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}

/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();

}
void initdata(){
/**put zeros and ones in entity_array**/

}
float somefunc(char *somedata){

float output=countzero();       //someother function not listed

return output;
}

void* thread_function()
{
    pthread_mutex_lock (&mutexdata);

    int currentthread = this->whichthread;
    this->whichthread+=1;

        pthread_mutex_unlock (&mutexdata);

    entity *ent = this->entity_array[currentthread];


    double A=0,B=0,C=0,D=0,E=0,F=0;
    int i,j,k,l;



         A = somefunc(ent->somedata[0]);
         B = somefunc(ent->somedata[1]);

        t4 = anotherfunc(A,B);



        ent->output   = t4;
        ent->fitness  = sqrt(pow(t4,2)); 





                    }



static void* staticthreadproc(void* p){


return reinterpret_cast<ga*>(p)->thread_function();

}


void eval_thread(){

//use multithreading to evaluate individuals in parallel 

int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];

//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;

for(i=0;i<nthreads;i++){

pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);

//printf("creating thread, %d\n",i);
}


//join threads
for(i=0;i<nthreads;i++){

pthread_join(threads[i],NULL);

}

}

};

I am using pthreads here because it works better than boost on machines with less memory. Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.

So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?

I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE

Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.

s kettle
  • 1
  • 5

2 Answers2

2

Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.

When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.

As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).

Community
  • 1
  • 1
Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • It uses the same amount of memory as the for loop and yet takes about twice the time. Thanks I have found a link here for setting stack size [link](http://stackoverflow.com/questions/7235934/pthreads-high-memory-usage) – s kettle Nov 28 '12 at 15:54
  • Edited my answer to include runtime issues. These are all guesses though, maybe something in there helps you. – Björn Pollex Nov 28 '12 at 16:02
0

Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.

Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.

You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.

Mike Seymour
  • 249,747
  • 28
  • 448
  • 644
  • I wanted to use as many threads as possible to do a costly function and thereby save on speed and memory. However this seems not to be possible. However I did notice that when using boost::thread the memory stayed the same for x number of threads whereas it increased for the for/loop alternative with each additional loop. I suppose there may be a way of optimizing threads, but for this purpose perhaps it would require some advanced memory allocation. – s kettle Nov 28 '12 at 18:27
  • If memory is already allocated to the entity array then how does running threads on them change this? Also regarding cpu should I worry about number of threads being the same as number of cpu's after all multithreading is older than dual core. E.g apache. I think your right in that it is the memory/cpu used to setup the threads themselves which doesnt save time or memory in the long run as compared to for/loop especially for many threads. – s kettle Nov 28 '12 at 18:32
  • Using "as many threads as possible" will only save time if all the threads can run concurrently. Once there are more active threads than there are CPUs, some of them must wait and you get no further speed improvement - and may lose some to the scheduling costs. – Mike Seymour Nov 29 '12 at 11:20
  • "multithreading is older than dual core E.g apache" - multithreading has other uses besides speeding up a CPU-bound program. In the case of an I/O-bound server, it allows you to simplify the programming logic by using blocking I/O operations (running one thread per connection). But if you want to improve the speed of a CPU-bound task, then you can't do better than running it concurrently on all CPUs, so having more threads than that won't help. (Also, multi-core servers have been around for a very long time, even if the technology has only recently become cheap enough for the desktop.) – Mike Seymour Nov 29 '12 at 11:25
  • "If memory is already allocated to the entity array then how does running threads on them change this?" I didn't say it would. My only comments about memory were (a) cache invalidation might hurt performance if two threads access nearby data, and (b) each thread will need some memory for its stack. – Mike Seymour Nov 29 '12 at 11:27