0

In this application I have groups of N (POSIX) threads. The first group starts up, creates an object A, and winds down. A little bit later a new group with N threads starts up, uses A to create a similar object B, and winds down. This pattern is repeated. The application is highly memory-intensive (A and B have a large number of malloc'ed arrays). I would like local access to memory as much as possible. I can use numactl --localalloc to achieve this, but in order for this to work I also need to make sure that those threads from the first and second group that work on the same data are bound to the same NUMA node. I've looked into sched_setaffinity, but wonder if better approaches exist.

The logic of the application is such that a solution where there are no separate thread groups would tear apart the program logic. That is, a solution where a single group of threads manages first object A and later object B (without winding down inbetween) would be extremely contrived and obliterate the object-oriented lay-out of the code.

micans
  • 1,106
  • 7
  • 16

1 Answers1

1

Binding threads in group B to the same cores that they ran on group A is more restrictive than what you need. Modern processors use dedicated level 1 cache (L1) and level 2 cache (L2) per core, so binding threads to a specific core makes sense only to get at data that is still "hot" in those caches. What you probably meant is binding group B threads to the same numa node as the threads in group A, so that the large arrays are in the same local memory.

That said, you have two choices:

  1. You set the affinity of group A to a specific numa node, and you use that same numa node to set the affinity of group B, or
  2. You find out which numa node your malloc'ed arrays are in and you set the affinity of group B to that numa node.

Option (1) is relatively easy, so let's talk about how to implement option (2).

The following SO answer describes how to find out, given a virtual address in your process, which numa node has that memory local:

Can I get the NUMA node from a pointer address (in C on Linux)?

There is an move_pages function in -lnuma: http://linux.die.net/man/2/move_pages which can report current state of address(page) to node mappings:

nodes can also be NULL, in which case move_pages() does not move any pages but instead will return the node where each page currently resides, in the status array. Obtaining the status of each page may be necessary to determine pages that need to be moved.

Armed with that information, you want to set the affinity of your group B threads to that numa node, for how to do that we go to this SO answer

How to ensure that std::thread are created in multi core?

for GNU/linux with POSIX threads you will want pthread_setaffinity_np(), in FreeBSD cpuset_setaffinity(), in Windows SetThreadAffinityMask(), etc.

Community
  • 1
  • 1
amdn
  • 11,314
  • 33
  • 45
  • Thanks - definitely meant numa node, was a bit sloppy and unfamiliar with terminology. Very useful information. I'm afraid I only need 1) - so I should go with sched_setaffinity - is that correct? I worried a bit about passing information about the available CPU sets to the program - this seems a necessary requirement. I've looked for a way to automatically determine the CPU sets available to the program, if that makes sense (so that each thread knows what CPUset it needs to bind to). Is that possible, or is it not the right question? – micans Apr 04 '14 at 16:56
  • Another possibility is to restrict the threads in your process to a specific node before the process starts. That, in addition to the `numactl --localalloc` to request memory from the local node, will ensure that your threads in group A and B always run in the node where the memory is local. You can do that with the `--cpunodebind` option. Note that by doing this you are restricting your application to a single node, so to the extent that your application is the only one on that node, and to the extent that your application would not benefit from more threads, it is a win. – amdn Apr 04 '14 at 18:19
  • Some good reading here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-cpu.html – amdn Apr 04 '14 at 18:20
  • The intent is to scale across many NUMA nodes. My current understanding is that I could use `numactl --localalloc --cpunodebind=cpus`. To get across-group consistency, I can hopefully use `pthread_attr_getaffinity_np` and `pthread_attr_setaffinity_np`. For a start I can begin to at least report the binding of threads to sets. Downside is that these are nonportable (np) extensions. – micans Apr 07 '14 at 12:51