Manually writing a multithreaded loop - suboptimal scalability

Question

I have written this test application: it goes through iterations from 0 to 9999, for each integer in the range it calculates some useless but calculation-intensive function. As a result the program outputs the sum of function values. To make it run on several threads I'm using InterlockedIncrement - if after increment the iteration number is <10000 then a thread processes this iteration, otherwise it terminates.

I am wondering why it is not scaling as well as I would like it to. With 5 threads it runs 8s versus 36s with a single thread. This gives ~4.5 scalability. During my experiments with OpenMP (on slightly different problems) I was getting much better scalability.

The source code is shown below.

I am running Windows7 OS on a Phenom II X6 desktop. Don't know what other parameters might be relevant.

Could you please help me explain this sub-optimal scalability? Many thanks.

#include <boost/thread.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/make_shared.hpp>
#include <vector>
#include <windows.h>
#include <iostream>
#include <cmath>

using namespace std;
using namespace boost;

struct sThreadData
{
  sThreadData() : iterCount(0), value( 0.0 ) {}
  unsigned iterCount;
  double value;
};

volatile LONG g_globalCounter;
const LONG g_maxIter = 10000;

void ThreadProc( shared_ptr<sThreadData> data )
{
  double threadValue = 0.0;
  unsigned threadCount = 0;

  while( true )
  {
    LONG iterIndex = InterlockedIncrement( &g_globalCounter );
    if( iterIndex >= g_maxIter )
      break;

    ++threadCount;

    double value = iterIndex * 0.12345777;
    for( unsigned i = 0; i < 100000; ++i )
      value = sqrt( value * log(1.0 + value) );

    threadValue += value;
  }

  data->value = threadValue;
  data->iterCount = threadCount;
}

int main()
{
  const unsigned threadCount = 1;

  vector< shared_ptr<sThreadData> > threadData;
  for( unsigned i = 0; i < threadCount; ++i )
    threadData.push_back( make_shared<sThreadData>() );

  g_globalCounter = 0;

  DWORD t1 = GetTickCount();
  vector< shared_ptr<thread> > threads;
  for( unsigned i = 0; i < threadCount; ++i )
    threads.push_back( make_shared<thread>( &ThreadProc, threadData[i] ) );

  double sum = 0.0;
  for( unsigned i = 0; i < threadData.size(); ++i )
  {
    threads[i]->join();
    sum += threadData[i]->value;
  }

  DWORD t2 = GetTickCount();
  cout << "T=" << static_cast<double>(t2 - t1) / 1000.0 << "s\n";

  cout << "Sum= " << sum << "\n";
  for( unsigned i = 0; i < threadData.size(); ++i )
    cout << threadData[i]->iterCount << "\n";

  return 0;
}

Edit: Attaching sample output of this test program (1 and 5 threads): enter image description here

Have you tried splitting the tasks beforehand instead of having the threads access shared state? — Jonas Schäfer, Oct 11 '12 at 11:45
Thanks for reading this. By shared state do you mean the _g_globalCounter_ variable? No, I haven't tried that. My assuption was that first-come-first-serve would give optimal load balancing. I have tried increasing the number of `value = sqrt( value * log(1.0 + value) );` iterations 10 times (this should reduce contention on the iteration counter). The results were 80.43s versus 358s - so I don't think the shared state is causing this. — Alexander Chertov, Oct 11 '12 at 11:51
Most notably, each thread has it's own sThreadData instance. — Jonas Schäfer, Oct 11 '12 at 12:15
@AlexanderChertov - OK, yes. Too many things called 'value' :) — Martin James, Oct 11 '12 at 12:29
Next bad guess how are sqrt/log implemented? FPU contention, maybe? — Martin James, Oct 11 '12 at 12:31
Hard to pick a name for a meaningless variable... Will try with an integer function and let you know if the results are different. — Alexander Chertov, Oct 11 '12 at 12:35
You would absolutely expect any type of synchronisation (such as `InterlockedIncrement()` to reduce performance -- I'm surprised it only goes down 10%, given how often you're calling it! To test whether this is the cause, please try running it without any synchronisation, simply sending `g_maxIter / threadCount` iterations to each thread and letting them run independently of each other. I'm 99% sure you'll get a big speedup. — j_random_hacker, Oct 11 '12 at 13:00
@j_random_hacker, I have changed the function interface to 'void ThreadProc( shared_ptr data, unsigned iStart, unsigned iEnd )'. No speedup, same old 36s vs 8s. — Alexander Chertov, Oct 11 '12 at 13:09
@MartinJames, it's not FPU contention either. I have changed the function loop to do `result = result * (123 + result);` - the speedup ratio is the same ~4.5. — Alexander Chertov, Oct 11 '12 at 13:16
Thanks (I assume you got rid of the `InterlockedIncrement()` too). I think it was worthwhile eliminating that as a possible cause. But in that case, I'm baffled! AFAIK each CPU has its own FPU (and SSE registers), so I can't see how there would be FP contention as Martin suggested. Do you have other programs running in the background? If you start 5 simultaneous instances of a single-threaded program that only does `g_maxIter / 5` iterations, do they each take longer than if you only start 1? — j_random_hacker, Oct 11 '12 at 13:18
:(( The bright side is that I have run out of bad guesses. Maybe someone else can come up with a good one? If you have six cores, why are you not creating six threads? — Martin James, Oct 11 '12 at 13:21
Also, what sort of scalability were you getting with OpenMP? — Martin James, Oct 11 '12 at 13:23
@j_random_hacker, running 5 instances of the single threaded app is slower then running just one. This doesn't tell me much, do you have any ideas? — Alexander Chertov, Oct 11 '12 at 13:41
While it doesn't tell us exactly what the problem is, it does tell us that it has nothing to do with shared state. — j_random_hacker, Oct 11 '12 at 13:43
@MartinJames, with OpenMP I was getting the same numbers (8s vs 36s). I guess it's on another PC and for another problem that I was getting better scaling. I even have turned off CoolNQuiet in BIOS... 5 threads not 6 is to reduce the impact from other tasks running in the background. — Alexander Chertov, Oct 11 '12 at 13:45
@AlexanderChertov err.. 'other tasks'? If youre running a benchmark, maybe you should not do that. — Martin James, Oct 11 '12 at 13:51
@MartinJames, by those I mean the browser, IMs, stuff like that. But I will definitely try closing everything. — Alexander Chertov, Oct 11 '12 at 13:57
I'm confused -- in your original post you said you were getting "much better scalability" with OpenMP, now you say you're getting "the same numbers"...? — j_random_hacker, Oct 11 '12 at 14:03
@j_random_hacker, my apologies. I was comparing with my previous experience with OpenMP - other problems on other PCs. This particular problem on this PC exibits 4.5X scaling whether I use OpenMP or not. I will run it on my PC at home and update the snipplet to include the OpenMP code later. — Alexander Chertov, Oct 11 '12 at 14:21
I see, thanks. In a way that's a good thing -- it means it's nothing particular to your coding strategy (and presumably your posted non-OpenMP code would get close to 5x speedup on those other computers). Wish I could think of a potential cause for the slowdown but I can't -- it's not even like there would be any significant bus contention since the threads are spending all their time doing CPU (well, FPU) work! — j_random_hacker, Oct 11 '12 at 14:35

score 2 · Accepted Answer · answered Oct 12 '12 at 10:01

It turned out the the results can be explained by the fact that my CPU supports the AMD Turbo Core technology.

While in Turbo CORE mode, the AMD Phenom™ II X6 1090T shifts frequency speed from 3.2GHz on six cores, to 3.6GHz on three cores

So the clock frequencies were not the same in single-threaded mode and multi-threaded mode. I was used to playing aroung with multithreading on CPUs that don't support TurboCore. Below is an image that shows results of

AMD OverDrive utility window (a thing that allows to toggle TurboCore on/off)
a run with 1 threads with TurboCore ON
a run with 1 threads with TurboCore OFF
a run with 5 threads

Many thanks to people who tried to help.

Manually writing a multithreaded loop - suboptimal scalability

1 Answers1