In a multi-threaded application running on a recent linux Distributed Shared Memory system, is there a straight forward way to count the number of requests per thread to remote (non-local) NUMA memory nodes?
I am thinking of using PAPI to count interconnect traffic. Is this the way to go?
In my application, threads are bound to a particular core or processor for their entire life-time. When the application begins, memory is allocated page wise and spread in a round-robin manner across all available NUMA memory nodes.
Thank you for your answers.