3

In a multi-threaded application running on a recent linux Distributed Shared Memory system, is there a straight forward way to count the number of requests per thread to remote (non-local) NUMA memory nodes?

I am thinking of using PAPI to count interconnect traffic. Is this the way to go?

In my application, threads are bound to a particular core or processor for their entire life-time. When the application begins, memory is allocated page wise and spread in a round-robin manner across all available NUMA memory nodes.

Thank you for your answers.

nandu
  • 2,563
  • 2
  • 16
  • 14
  • How much slow down are you willing to accept? I wrote a PIN tool to trace every memory access and attribute them to either local node or remote node. – Mark Jun 12 '12 at 01:05

3 Answers3

4

If you have access to VTune, local and remote NUMA node accesses are counted by hardware counters OFFCORE_RESPONSE.ANY_DATA.OTHER_LOCAL_DRAM_0 for fast local NUMA node accesses and OFFCORE_RESPONSE.ANY_DATA.REMOTE_DRAM_0 for slower remote NUMA node acccesses.

How the counters appear in VTune:

Configuring NUMA hardware counters in VTune

How the counters look in two scenarios:

NUMA unhappy code: core 0 (NUMA node 0) increments 50 MB residing on NUMA node 1: NUMA unhappy code with many remote NUMA node accesses

NUMA happy code: core 0 (NUMA node 0) increments 50 MB residing on NUMA node 0: NUMA happy code with many local NUMA node accesses

Neil Justice
  • 1,325
  • 2
  • 12
  • 25
1

I found the pcm-numa.x tool that comes with Intel PCM to be quite useful. It tells you the number of times each core has accessed the local or remote NUMA nodes.

Fidel
  • 7,027
  • 11
  • 57
  • 81
-1

I'm not sure this qualifies as straight forward, and I don't know what a "Distributed Shared Memory System" is, but, on normal Linux anyway, if you have access to the source you may be able to count the requests yourself. You could use the answer to my "Can I get the NUMA node from a pointer address?" question here to figure out what node the memory requested is on, and knowing the node your thread is on tally up the remote requests. This is only going to tell you how often you're using remote memory, rather than when that memory is not in your local cache already and has to be fetched, so it may not be exactly what you want.

If you want to know about cache misses on remote memory, try adding the profiling tag to your question - it might attract more readers. If there's a profiler that will distinguish local memory misses from remote memory misses I'd be interested to find out too.

Community
  • 1
  • 1
Rob_before_edits
  • 1,163
  • 9
  • 13