12

Shortly about my problem:

I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.

I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.

How do I make one-program version as fast as two-programs?


More details:

I have a big number of tasks and want to fully load all 32 cores of the system. So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for loop distribute tasks between 32 cores.

I use pthread_setaffinity_np to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.

I use mlockall(MCL_FUTURE) to insure that system would not make my memory jump between sockets.

So the code looks like this:

  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){

      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated

      TaskManager manager;
      for (int j = 0; j < fNTasksPerThr; j++){
        manager.SetData( &(InpData->fInput[j]) );
        manager.Run();
      }
    }
  }

Only the computing time is important to me therefore I prepare input data in separate parallel_for loop. And do not include preparation time in time measurements.

  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){

      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
      InpData[i].fInput = new ProgramInputData[fNTasksPerThr];

      for(int j=0; j<fNTasksPerThr; j++){
        InpData[i].fInput[j] = InpDataPerThread.fInput[j];
      }
    }
  }

Now I run all these on 32 cores and see speed of ~1600 tasks per second.

Then I create two version of program, and with taskset and pthread insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply & command in shell:

program1 & program2 &

Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.

What do I miss?

I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?

klm123
  • 12,105
  • 14
  • 57
  • 95
  • Have you tried 32 single threaded programs? – BЈовић Nov 13 '13 at 09:42
  • 32 single threaded programs wouldn't deal with the issue which is likely memory allocation in the wrong numa node. He only has 2 nodes, so he only needs 2 programs with each tied to a single node. – Len Holgate Nov 13 '13 at 09:43
  • 2
    Numa node?? I have no idea what that is, but it sounds so good that I am going to find out. – Dennis Nov 13 '13 at 10:44
  • @Dennis, Yes, numa. If you want I can show the topology and scalability tests. – klm123 Nov 13 '13 at 11:05
  • Interesting stuff. I hadn't had to look into it before, but I can see how much benefit it could have in parallel processing of large data sets. – Dennis Nov 13 '13 at 17:23
  • How much overall memory does your system have? 32 groups of 120MB data each are ~4GB in the same virtual space. Maybe you're killing your page maps? – Leeor Nov 13 '13 at 19:56
  • @Leeor, killing? Then I would have a crash? It has 64GB. – klm123 Nov 13 '13 at 20:14
  • I meant stressing, but I guess it shouldn't exceed 64GB. Still, the virtual address space is probably one of the biggest difference between the two approaches, maybe something related is causing this impact. – Leeor Nov 13 '13 at 20:32
  • Instead of trying to guess the possible reason, have you ever considered running your executable with a performance tool and measuring if it is really remote memory accesses that impact the performance? I would suggest [likwid-perfctr](http://code.google.com/p/likwid/wiki/LikwidPerfCtr) from the LIKWID toolset. Something like `likwid-perfctr -g MEM ./executable` should suffice. – Hristo Iliev Nov 14 '13 at 09:29
  • Hristo - now, what I need is a version of that tool that runs on Windows... – Len Holgate Nov 28 '13 at 09:04

2 Answers2

3

I would guess that it's STL/boost memory allocation that's spreading memory for your collections, etc across numa nodes due to the fact that they're not numa aware and you have threads in the program running on each node.

Custom allocators for all of the STL/boost things that you use might help (but is likely a huge job).

Len Holgate
  • 21,282
  • 4
  • 45
  • 92
  • Shouldn't mlockall(MCL_FUTURE) help? This http://linux.die.net/man/2/mlock says that it has to help for all future memory allocations. – klm123 Nov 13 '13 at 09:49
  • 1
    I expect that the libs allocate memory before that and also, since they no nothing of numa, they likely reuse memory between collections or internally. Using custom containers is probably the way to go IMHO. – Len Holgate Nov 13 '13 at 10:03
  • You seem to be right. Minimising usage of std::vector::reserve I managed to reduce the time difference to 2%. – klm123 Nov 15 '13 at 12:27
1

You might be suffering a bad case of false sharing of cache: http://en.wikipedia.org/wiki/False_sharing

Your threads probably share access to the same data structure through the block_range reference. If speed is all you need, you might want to pass a copy to each thread. If your data is too huge to fit onto the call-stack you could dynamically allocate a copy of each range in different cache segments (i.e. just make sure they are far enough appart).

Or maybe I need to see the rest of the code to understand what you are doing better.

  • I am not sure I understand you. What data you are talking about? block_range is very small structure and it is not used inside of the program (TaskManger). All data which are used I have copied and dynamically allocated already. – klm123 Nov 13 '13 at 09:58
  • Ok. You are right. I misunderstood the purpose and nature of block_range. I thought it was some common data you were manipulating. I now understand it is a TBB template for integer intervals. My bad. How is InpData defined? – Robert Jørgensgaard Engdahl Nov 13 '13 at 10:26