Shared memory access latency on XEON

Question

I am running this code on Intel XEON Gold in order to measure the latency because of shared memory access. I have created 5 threads running on different cores and there is shared memory for inter core communication.

#define _GNU_SOURCE
#define _POSIX_C_SOURCE 200112L
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <linux/mman.h>
#include <string.h>
#include <unistd.h>
#include <pthread.h>
#define __STDC_FORMAT_MACROS
#include <inttypes.h>
#include <sched.h>
typedef uint64_t time_tst;

#define NUM_THREADS   5

struct thread_info {
    int core_id; 
    int *addr;
};

time_tst time_tcv(void)
{ 
   unsigned long low, high;
   __asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
   return (((uint64_t)high << 32) | low);
}
void* create_shared_memory(size_t size) 
{
  int fd = shm_open("carmv2shm", O_CREAT|O_RDWR, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | 
 S_IWOTH);
   if (!fd){
     printf("shm open error \n");
     return 0;
   }
   else
  {
     ftruncate(fd, 0x1000*size);
     return mmap(NULL, 0x1000*size, PROT_READ | PROT_WRITE, MAP_LOCKED|MAP_SHARED_VALIDATE, fd, 0);
   }
}

void* thread_func(void *args)
{
  struct thread_info *thread_info = args;
  pthread_t self = pthread_self();
  const unsigned int core_id = thread_info->core_id;
  cpu_set_t set;
  CPU_ZERO(&set);
  CPU_SET(core_id, &set);
            
  if(pthread_setaffinity_np(self, sizeof(set), &set) < 0){
     printf("Error setting affinity \n");
  }
  
 
  time_tst t1 = 0;
  time_tst t2 = 0;
 
  char message[] = "hello message";
 
  t1 = time_tcv();
  memcpy(thread_info->addr, message, sizeof(message));
  t2 = time_tcv();
  printf("thread id %u core id %u time diff 0x%" PRIu64 "\n", (unsigned int)self, core_id, (t2-t1)); 
  
  return 0;
}

int main()
{
  int i = 0;
  pthread_mutex_init(&lock, NULL);
  
  void* shmem = create_shared_memory(128);
  struct thread_info *thread_info = calloc(NUM_THREADS, sizeof(struct thread_info));
  
  thread_info->addr = shmem;
  pthread_t tid[NUM_THREADS];
  
  while(i<NUM_THREADS)
  {
    thread_info->core_id = i + 1;
    pthread_create(&tid[i], NULL, thread_func, (void*)thread_info);
    usleep(1);
    i++;
  }
  
  i = 0;
  while(i<NUM_THREADS)
  {
    pthread_join(tid[i], NULL);
    i++;
  }
  
  return 1;
}

The output is :

thread id 2912491264 core id 1 time diff 0x6312
thread id 2904098560 core id 2 time diff 0x486
thread id 2895705856 core id 3 time diff 0x498
thread id 2753095424 core id 4 time diff 0x522
thread id 2818569984 core id 5 time diff 0x230

This time difference looks quite high to me. Could anyone suggest how to reduce this difference.
Thanks

You are running a tiny piece of code once with no cache or vm concerns. I would not expect consistent results. — stark, Aug 18 '21 at 11:03
@stark , Could you please elaborate. Actually my concern here is not about consistency, but about time difference. I would like to be it much lower, preferably two digits. — Swati Kunwar, Aug 18 '21 at 11:07
For example, the first number is most likely higher because it is measuring the time to copy instructions from RAM to icache. — stark, Aug 18 '21 at 11:18
Yes that is right but the another numbers are also quite high. — Swati Kunwar, Aug 18 '21 at 11:56
The first number may be due to the actual mapping of the (single) page backing the allocated memory. The other may be due to the fact that all the cores (or are they hw threads? not sure what `CPU_SET` set) are writing to the same memory, thereby ping-ponging the cache line in M state between each other. Particularly if `memcpy` is not optimized and takes many accesses to copy the string. Though, not sure if the execution actually overlaps. Also, I think the way `thread_info` is passed may open up a (albeit unlikely) data race (just saying). — Margaret Bloom, Aug 18 '21 at 14:55
You probably need to repeat the operation several times (with only one shared memory allocation). Frequency scaling, code cache misses, and the delay to create threads as well as cstates are few possible sources of time variation and of the big timings. You need to be *very careful* when you write such a benchmark. — Jérôme Richard, Aug 18 '21 at 18:30
@MargaretBloom: Consecutive cache-miss stores to the same line in memcpy are likely to get coalesced into one LFB waiting to commit. Even if not, it's likely that once a core finally gains exclusive ownership of the line, a couple stores could commit before it acknowledges RFOs from other cores, so probably multiple stores could commit. (glibc memcpy will do two partially-overlapping 8-byte stores for a 14-byte memcpy with no intervening loads or other stores, into the same cache line since the shmem should be at least 16-byte aligned.) — Peter Cordes, Aug 19 '21 at 01:34
@MargaretBloom: re: data race: yeah, every thread gets a pointer to the same `thread_info` struct, so if it doesn't read it before `main`'s loop changes it, it will get the wrong `thread_info->core_id = i;` value. The code in the question already shows evidence of that race being a real problem, with two `core id 4` lines. — Peter Cordes, Aug 19 '21 at 01:39
@MargaretBloom Thanks for pointing out race condition. I corrected the code. Though the delay remains high. I even tried to change the implementation to 64 bit because of 64 bit cache line in Intel XEON but no luck — Swati Kunwar, Aug 19 '21 at 10:57
Also, to rule out shared memory issue, I created local memory instead of shared one to check the delay. Delay is still high. It confirms that issue is not shared memory access. It might be with cache or memcpy. XEON is doing something strange. — Swati Kunwar, Aug 19 '21 at 12:11
You actually just papered over the race condition with `usleep`, enough that it doesn't happen in practice on a lightly loaded system; ok for benchmarking. In a new thread, maybe on a different core, it's likely that the CPU is still at idle frequency, much lower than the reference frequency which RDTSC ticks at. And that the caches and TLB will be cold, so you get a TLB miss and then an RFO on the memory that's shared between threads. You probably also get some code cache misses and other warm-up effects. ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) — Peter Cordes, Aug 19 '21 at 12:48
@PeterCordes As per Intel guide, RDTSC looks synchronized among cores. How could i avoid cache /TLB miss. For my assignment, this latency need to be reduced. I also tried fork() instead of threads to get producer consumer set up. in that the latency is more deterministic but still high. — Swati Kunwar, Aug 19 '21 at 20:16
Yeah, in most systems RDTSC is in sync across cores. But my point is, if the physical core clock is much slower than max, things that take a fixed number of core clocks will take more counts of ["reference cycles"](https://stackoverflow.com/a/51907627/224132). (Inter-core communication includes a lot of time that goes at uncore clock speed, and the uncore could be running faster.) — Peter Cordes, Aug 19 '21 at 20:25
To reduce TLB misses, do some warm-up runs *before* the timed run, to prime this core's TLB. As always with benchmarking, warm-up is important. (As well as understanding the details of what you're trying to measure, for something as short as one write to shared memory. You're just measuring the local store, not even waiting for it to leave the store buffer and commit to L1d cache and be visible to other cores.) — Peter Cordes, Aug 19 '21 at 20:26
@PeterCordes Thanks. Could you please give some code example for warm up. Also what i should add in code to wait when data is visible to another core. — Swati Kunwar, Aug 19 '21 at 20:38
@PeterCordes Actually the reason I am checking time diff here because in profiler tool on XEON, we see that linux kernel is quite busy while this inter core read write. The data is small so it is not expected. To investigate that issue, I started looking into the latency around memcpy. — Swati Kunwar, Aug 19 '21 at 20:42
For warmup, you could just do a `(volatile int*)` write to the cache line a few million times. A write instead of a read would make sure the microcode assist for the page-table's "dirty" bit isn't part of what you're measuring, which may partly explain the large time measurement even though you're just running memcpy without any `mfence`. (And without `lfence` to block OoO exec of `rdtsc` itself.) — Peter Cordes, Aug 19 '21 at 20:43

Shared memory access latency on XEON

0 Answers0