3

I have written this test application: it goes through iterations from 0 to 9999, for each integer in the range it calculates some useless but calculation-intensive function. As a result the program outputs the sum of function values. To make it run on several threads I'm using InterlockedIncrement - if after increment the iteration number is <10000 then a thread processes this iteration, otherwise it terminates.

I am wondering why it is not scaling as well as I would like it to. With 5 threads it runs 8s versus 36s with a single thread. This gives ~4.5 scalability. During my experiments with OpenMP (on slightly different problems) I was getting much better scalability.

The source code is shown below.

I am running Windows7 OS on a Phenom II X6 desktop. Don't know what other parameters might be relevant.

Could you please help me explain this sub-optimal scalability? Many thanks.

#include <boost/thread.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/make_shared.hpp>
#include <vector>
#include <windows.h>
#include <iostream>
#include <cmath>

using namespace std;
using namespace boost;

struct sThreadData
{
  sThreadData() : iterCount(0), value( 0.0 ) {}
  unsigned iterCount;
  double value;
};

volatile LONG g_globalCounter;
const LONG g_maxIter = 10000;

void ThreadProc( shared_ptr<sThreadData> data )
{
  double threadValue = 0.0;
  unsigned threadCount = 0;

  while( true )
  {
    LONG iterIndex = InterlockedIncrement( &g_globalCounter );
    if( iterIndex >= g_maxIter )
      break;

    ++threadCount;

    double value = iterIndex * 0.12345777;
    for( unsigned i = 0; i < 100000; ++i )
      value = sqrt( value * log(1.0 + value) );

    threadValue += value;
  }

  data->value = threadValue;
  data->iterCount = threadCount;
}

int main()
{
  const unsigned threadCount = 1;

  vector< shared_ptr<sThreadData> > threadData;
  for( unsigned i = 0; i < threadCount; ++i )
    threadData.push_back( make_shared<sThreadData>() );

  g_globalCounter = 0;

  DWORD t1 = GetTickCount();
  vector< shared_ptr<thread> > threads;
  for( unsigned i = 0; i < threadCount; ++i )
    threads.push_back( make_shared<thread>( &ThreadProc, threadData[i] ) );

  double sum = 0.0;
  for( unsigned i = 0; i < threadData.size(); ++i )
  {
    threads[i]->join();
    sum += threadData[i]->value;
  }

  DWORD t2 = GetTickCount();
  cout << "T=" << static_cast<double>(t2 - t1) / 1000.0 << "s\n";

  cout << "Sum= " << sum << "\n";
  for( unsigned i = 0; i < threadData.size(); ++i )
    cout << threadData[i]->iterCount << "\n";

  return 0;
}

Edit: Attaching sample output of this test program (1 and 5 threads): enter image description here

Alexander Chertov
  • 2,070
  • 13
  • 16
  • 1
    Have you tried splitting the tasks beforehand instead of having the threads access shared state? – Jonas Schäfer Oct 11 '12 at 11:45
  • Thanks for reading this. By shared state do you mean the _g_globalCounter_ variable? No, I haven't tried that. My assuption was that first-come-first-serve would give optimal load balancing. I have tried increasing the number of `value = sqrt( value * log(1.0 + value) );` iterations 10 times (this should reduce contention on the iteration counter). The results were 80.43s versus 358s - so I don't think the shared state is causing this. – Alexander Chertov Oct 11 '12 at 11:51
  • OK, false-sharing on the sThreadData instances? – Martin James Oct 11 '12 at 12:13
  • Those are only modified at the end of ThreadProc... – Alexander Chertov Oct 11 '12 at 12:15
  • Most notably, each thread has it's own sThreadData instance. – Jonas Schäfer Oct 11 '12 at 12:15
  • @AlexanderChertov - OK, yes. Too many things called 'value' :) – Martin James Oct 11 '12 at 12:29
  • 1
    Next bad guess how are sqrt/log implemented? FPU contention, maybe? – Martin James Oct 11 '12 at 12:31
  • Hard to pick a name for a meaningless variable... Will try with an integer function and let you know if the results are different. – Alexander Chertov Oct 11 '12 at 12:35
  • You would absolutely expect any type of synchronisation (such as `InterlockedIncrement()` to reduce performance -- I'm surprised it only goes down 10%, given how often you're calling it! To test whether this is the cause, please try running it without any synchronisation, simply sending `g_maxIter / threadCount` iterations to each thread and letting them run independently of each other. I'm 99% sure you'll get a big speedup. – j_random_hacker Oct 11 '12 at 13:00
  • 1
    @j_random_hacker, I have changed the function interface to 'void ThreadProc( shared_ptr data, unsigned iStart, unsigned iEnd )'. No speedup, same old 36s vs 8s. – Alexander Chertov Oct 11 '12 at 13:09
  • @MartinJames, it's not FPU contention either. I have changed the function loop to do `result = result * (123 + result);` - the speedup ratio is the same ~4.5. – Alexander Chertov Oct 11 '12 at 13:16
  • 2
    Thanks (I assume you got rid of the `InterlockedIncrement()` too). I think it was worthwhile eliminating that as a possible cause. But in that case, I'm baffled! AFAIK each CPU has its own FPU (and SSE registers), so I can't see how there would be FP contention as Martin suggested. Do you have other programs running in the background? If you start 5 simultaneous instances of a single-threaded program that only does `g_maxIter / 5` iterations, do they each take longer than if you only start 1? – j_random_hacker Oct 11 '12 at 13:18
  • :(( The bright side is that I have run out of bad guesses. Maybe someone else can come up with a good one? If you have six cores, why are you not creating six threads? – Martin James Oct 11 '12 at 13:21
  • Also, what sort of scalability were you getting with OpenMP? – Martin James Oct 11 '12 at 13:23
  • @j_random_hacker, running 5 instances of the single threaded app is slower then running just one. This doesn't tell me much, do you have any ideas? – Alexander Chertov Oct 11 '12 at 13:41
  • While it doesn't tell us exactly what the problem is, it does tell us that it has nothing to do with shared state. – j_random_hacker Oct 11 '12 at 13:43
  • @MartinJames, with OpenMP I was getting the same numbers (8s vs 36s). I guess it's on another PC and for another problem that I was getting better scaling. I even have turned off CoolNQuiet in BIOS... 5 threads not 6 is to reduce the impact from other tasks running in the background. – Alexander Chertov Oct 11 '12 at 13:45
  • @AlexanderChertov err.. 'other tasks'? If youre running a benchmark, maybe you should not do that. – Martin James Oct 11 '12 at 13:51
  • @MartinJames, by those I mean the browser, IMs, stuff like that. But I will definitely try closing everything. – Alexander Chertov Oct 11 '12 at 13:57
  • I'm confused -- in your original post you said you were getting "much better scalability" with OpenMP, now you say you're getting "the same numbers"...? – j_random_hacker Oct 11 '12 at 14:03
  • @j_random_hacker, my apologies. I was comparing with my previous experience with OpenMP - other problems on other PCs. This particular problem on this PC exibits 4.5X scaling whether I use OpenMP or not. I will run it on my PC at home and update the snipplet to include the OpenMP code later. – Alexander Chertov Oct 11 '12 at 14:21
  • I see, thanks. In a way that's a good thing -- it means it's nothing particular to your coding strategy (and presumably your posted non-OpenMP code would get close to 5x speedup on those other computers). Wish I could think of a potential cause for the slowdown but I can't -- it's not even like there would be any significant bus contention since the threads are spending all their time doing CPU (well, FPU) work! – j_random_hacker Oct 11 '12 at 14:35

1 Answers1

2

It turned out the the results can be explained by the fact that my CPU supports the AMD Turbo Core technology.

While in Turbo CORE mode, the AMD Phenom™ II X6 1090T shifts frequency speed from 3.2GHz on six cores, to 3.6GHz on three cores

So the clock frequencies were not the same in single-threaded mode and multi-threaded mode. I was used to playing aroung with multithreading on CPUs that don't support TurboCore. Below is an image that shows results of

  • AMD OverDrive utility window (a thing that allows to toggle TurboCore on/off)
  • a run with 1 threads with TurboCore ON
  • a run with 1 threads with TurboCore OFF
  • a run with 5 threads enter image description here

Many thanks to people who tried to help.

Alexander Chertov
  • 2,070
  • 13
  • 16