11

While trying to increase the speed of my applications on non-NUMA / standard PCs I always found that the bottleneck was the call to malloc() because even in multi-core machines it is shared/synch between all the cores.

I have available a PC with NUMA architecture using Linux and C and I have two questions:

  1. In a NUMA machine, since each core is provided with its own memory, will malloc() execute independently on each core/memory without blocking the other cores?
  2. In these architectures how are the calls to memcpy() made? Can this be called independently on each core or, calling it in once core will block the others? I maybe wrong but I remember that also memcpy() got the same problem of malloc() i.e. when one core is using it the others have to wait.
unwind
  • 391,730
  • 64
  • 469
  • 606
Abruzzo Forte e Gentile
  • 14,423
  • 28
  • 99
  • 173

2 Answers2

6

A NUMA machine is a shared memory system, so memory accesses from any processor can reach the memory without blocking. If the memory model were message based, then accessing remote memory would require the executing processor to request that the local processor perform the desired operation. However, in a NUMA system, a remote processor may still impact the performance of the close processor due to utilizing the memory links, though this can depend on the specific architectural configuration.

As for 1, this entirely depends on the OS and malloc library. The OS is responsible for presenting the per-core / per-processor memory as either a unified space or as NUMA. Malloc may or may not be NUMA-aware. But fundamentally, the malloc implementation may or may not be able to execute concurrently with other requests. And the answer from Al (and associated discussion) addresses this point in greater detail.

As for 2, as memcpy consist of a series of loads and stores, the only impact would again be the potential architectural effects of using the other processors' memory controllers, etc.

Brian
  • 2,693
  • 5
  • 26
  • 27
  • Hi Brian. Thanks a lot. Are you aware of any good malloc library that is NUMA aware? I googled and I found MPC...is it good in your opinion? – Abruzzo Forte e Gentile Mar 29 '11 at 17:04
  • In the rare times that I am writing something to be NUMA-aware, I directly allocate my memory from the OS using VirtualAllocExNuma (Windows) or libnuma (linux). – Brian Mar 30 '11 at 03:11
2
  1. Calls to malloc in separate processes will execute independently regardless of whether you are on a NUMA architecture. Calls to malloc in different threads of the same process cannot execute independently because the memory returned is equally accessible to all threads within the process. If you want memory that is local to a particular thread, read up on Thread Local Storage. I have not been able to find any clear documentation on whether the Linux VM and scheduler are able optimize the affinity between cores, threads, local memory and thread local storage.
Al Riddoch
  • 583
  • 2
  • 9
  • "Calls to malloc in different threads of the same process cannot execute independently" - on non-NUMA they can with per-thread memory pools, although then calls to `free` in different threads might not be independent, since of course you can free memory in a different thread from where you allocated it. – Steve Jessop Mar 29 '11 at 11:39
  • But that's entirely up to the malloc implementation. People typically use 3. party malloc libraries (e.g. tcmalloc) to improve performance in multi threaded application (though neither tcmalloc nor the glibc malloc take NUMA into account) – nos Mar 29 '11 at 14:10
  • Hi Steve. As far as I know memory pools are just continuously preallocated chunkes of memory that are never freed (..at least that's how I we used in non-NUMA architecture ). It seems to me that what are you proposing is more a solution based on some library doing 2 things: A= creating a memory pool per-thread B=redefine the malloc behavior. Am I correct or it is something really specified at OS level for that kind of hardware? – Abruzzo Forte e Gentile Mar 29 '11 at 15:11
  • @Abruzzo: I think "pool" is used in different contexts for different things. A pool doesn't necessarily mean something where there's no means to free an individual allocation, and yes I do mean that `malloc` implementations can do this. I added the comment because I wasn't sure whether Al was talking about NUMA or non-NUMA architectures (or both) in that sentence, and I just wanted to chip in that it's up to the implementation where `malloc` memory actually comes from, and normally Linux lets you pick an allocator that does avoid most contention on `malloc`. – Steve Jessop Mar 29 '11 at 21:06
  • p.s. I found also libnuma and TCMalloc but it seems not good enough ( at least TCMalloc seems good for non-NUMA only). – Abruzzo Forte e Gentile Mar 29 '11 at 22:01