8

I have a dual socket Xeon E5522 2.26GHZ machine (with hyperthreading disabled) running ubuntu server on linux kernel 3.0 supporting NUMA. The architecture layout is 4 physical cores per socket. An OpenMP application runs in this machine and i have the following questions:

  1. Does an OpenMP program take advantage (i.e a thread and its private data are kept on a numa node along the execution) automatically when running on a NUMA machine + aware kernel?. If not, what can be done?

  2. what about NUMA and per thread private C++ STL data structures ?

labotsirc
  • 722
  • 7
  • 21
  • Please, define what kind of advantage you mean in "advantage when running on a NUMA machine". OpenMP is currently not NUMA aware, but OpenMP 4.0 would likely bring provisions for improved threads binding. – Hristo Iliev Aug 15 '12 at 12:29
  • i updated the question, is mainly what you pointed out. What about 'taskset'? will it help to bind the threads so that private data per threads is kept local? – labotsirc Aug 15 '12 at 16:10

2 Answers2

17

The current OpenMP standard defines a boolean environment variable OMP_PROC_BIND that controlls binding of OpenMP threads. If set to true, e.g.

shell$ OMP_PROC_BIND=true OMP_NUM_THREADS=12 ./app.x

then the OpenMP execution environment should not move threads between processors. Unfortunately nothing more is said about how those threads should be bound and that's what a special working group in the OpenMP language comittee is addressing right now. OpenMP 4.0 will come with new envrionment variables and clauses that will allow one to specify how to distribute the threads. Of course, many OpenMP implementations offer their own non-standard methods to control binding.

Still most OpenMP runtimes are not NUMA aware. They will happily dispatch threads to any available CPU and you would have to make sure that each thread only access data that belongs to it. There are some general hints in this direction:

  • Do not use dynamic scheduling for parallel for (C/C++) / DO (Fortran) loops.
  • Try to initialise the data in the same thread that will later use it. If you run two separete parallel for loops with the same team size and the same number of iteration chunks, with static scheduling chunk 0 of both loops will be executed by thread 0, chunk 1 - by thread 1, and so on.
  • If using OpenMP tasks, try to initialise the data in the task body, because most OpenMP runtimes implement task stealing - idle threads can steal tasks from other threads' task queues.
  • Use a NUMA-aware memory allocator.

Some colleagues of mine have thoroughly evaluated the NUMA behavious of different OpenMP runtimes and have specifically looked into the NUMA awareness of the Intel's implementation, but the articles are not published yet so I cannot provide you with a link.

There is one research project, called ForestGOMP, which aims at providing a NUMA-aware drop-in replacement for libgomp. May be you should give it a look.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • very useful tips, thanks. As for your last tip itemized; what happens with the case of the "object m = new object(arg);" object allocator in C++, is there an equivalent as tcmalloc for C ? – labotsirc Aug 16 '12 at 15:24
  • `tcmalloc` supports both C and C++. As for the C++ `new` operatior, you can use the placement syntax to put the object into memory, previously allocated by `tcmalloc`, or rely on `tcmalloc` replacing the standard `malloc()` call by preloading (as `new` is technically a wrapper around `malloc`). – Hristo Iliev Aug 16 '12 at 16:31
  • thanks Hristo, tcmalloc pre-loading + taskset is helping to scale the application within the same socket, however when i start using additional cores from the second socket (without taskset, letting openMP and OS handle thread binding) the speedup does not improve or can even be worse than using one socket. Under this circunstances i could (1) wait for version 4.0 or (2) drop OpenMP and use ptrheads but im not sure if this second option will really let me solve the NUMA aware problem, i have to investigate that – labotsirc Aug 19 '12 at 21:25
  • Hmm, is tcmalloc numa aware? – Amos Mar 07 '18 at 12:44
  • @Amos, vanilla `tcmalloc` is not NUMA-aware. There was a modified version from AMD, but the source code seems to no longer be available. – Hristo Iliev Mar 07 '18 at 13:25
0

You can also check you make your memory placement and access in the right way with a new tool to profile NUMA applications and now open-source for Linux : NUMAPROF : https://memtt.github.io/numaprof/.