2

I am trying to show that avoiding false sharing results in a performance benefit when using two vectors of integers, a reader vector (values to be read from) and a writer vector (where I am storing the values).

Assume the readWrite function below is defined.

void readWrite(std::vector<int> &reader, std::vector<int> &writer, int start, int stop, int step)
{
  /*
    Read every `step` value from `reader` to `writer`
    starting from index `start` up to index `stop` (excluding).

    Contract:
      readWrite: std::vector<int>&, std::vector<int>&, int, int, int => void

    Requirements:
      reader.size() == writer.size() == BUFFER_SIZE
      start < reader.size(), write.size()

    Effects:
      Mutates `writer`
  */
  for (int i = start; i < reader.size() && i < stop; i += step)

  {
    for (int j = 0; j < REPS; ++j)
    {
      writer.at(i) = reader.at(i);
      writer.at(i) -= reader.at(i);
      writer.at(i) = reader.at(i);
    }
  }
}

And is used like so:

std::chrono::microseconds mtBench(std::vector<int> &reader, std::vector<int> &writer, std::vector<std::thread> &threads)
{
  note->info("Threads scheduling...");
  /*`slice` represents how many elements each thread is responsible for*/
  int slice = (int)(BUFFER_SIZE / THREAD_COUNT);
  auto start = std::chrono::high_resolution_clock::now();
  /**
   * MT_NOSHARE
   * If BUFFER_SIZE = 1024 and THREAD_COUNT = 2, then slice = 512
   * tid 0, 1 => start = 0 * 512, 1 * 512 respectively;
   * and stop = (0 * 512) + 512, (1 * 512) + 512 respectively
   * step = 1 because each thread is reading one element at a time
   *
   * MT_SHARE
   * threads are side by side here
   */
  for (int i = 0; i < THREAD_COUNT; ++i)
  {

#ifdef MT_NOSHARE
    threads.at(i) = (std::thread(readWrite, std::ref(reader), std::ref(writer), i * slice, (i * slice) + slice, 1));
#endif

#ifdef MT_SHARE
    threads.at(i) = (std::thread(readWrite, std::ref(reader), std::ref(writer), i, BUFFER_SIZE, THREAD_COUNT));
#endif
  }

  for (int i = 0; i < THREAD_COUNT; ++i)
  {
    threads.at(i).join();
  }
  auto end = std::chrono::high_resolution_clock::now();
  note->info("Threads finished...");
  std::chrono::microseconds duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
  return duration;
}

note is just something I used for logging (single threaded coloured log) from spdlog. You may assume that BUFFER_SIZE is some power of two (I use pow(2,24)) and THREAD_COUNT is between 2 and 8 (my machine can run 8 threads in parallel as gotten from std::thread().hardware_concurrency()) and REPS is some power of two as well (I kept it as pow(2, 3)).

I initialize threads of size THREAD_COUNT, reader to integers from 0 to BUFFER_SIZE - 1 and writer to a size of BUFFER_SIZE.

I make 20 calls to mtBench with the necessary parameters and I take the duration then get the average after 20 calls. However, my logs show me that there is not much difference between the averages when I am false sharing versus when I am not false sharing.

These are snippets of the logs:

When I am indeed false sharing

[info] Max # of threads/cores run parallel  - (8)
[info] Multi-threaded Benchmark False Sharing...
[info] Buffer size - (16777216), # of threads - (2)

[info] Iteration - (1)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3524508)
[info] After Clearing writer...

[info] Iteration - (2)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3430828)
[info] After Clearing writer...

..........

[info] Iteration - (20)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3452525)
[info] After Clearing writer...

[info] Mean time (microseceonds) after (20) runs - (3438808.85)

When I am not false sharing

[info] Max # of threads/cores run parallel  - (8)
[info] Multi-threaded Benchmark No False Sharing...
[info] Buffer size - (16777216), # of threads - (2), slice size - (8388608)

[info] Iteration - (1)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3435623)
[info] After Clearing writer...

[info] Iteration - (2)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3420221)
[info] After Clearing writer...

.........

[info] Iteration - (20)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3428355)
[info] After Clearing writer...

[info] Mean time (microseceonds) after (20) runs - (3428552.75)

So 3438808.85 microseconds when false sharing and 3428552.75 microseconds when not false sharing. I expected that there would be a significant difference in the averages.

Also these are my machine specifications:

System Version: macOS 12.3 (21E230)
Kernel Version: Darwin 21.4.0

Model Name: MacBook Air
Model Identifier: MacBookAir10,1
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB

Also when I run sudo sysctl -a | grep cacheline in my terminal I get hw.cachelinesize: 128 so I am pretty sure that I would be false sharing quite a bit for the first 32 integers.

I was advised to use perf to view the cache miss rate but apparently it is not available on MacOS. II tried to run my program on an Amazon EC2 Instance Ubuntu2004 using cachegrind from valgrind but I could only get one core on the free tier so my multithreaded program was just sequential so my results are inconsequential. Initially, my readWrite function did not have the inner for loop with j. It was just writer.at(i) = reader.at(i); but I thought that reading and writing from the same index multiple times per thread would allow me to see a significant performance difference but unfortunately I cannot see a significant change. I have also tried my benchmark with THREAD_COUNT as 4 but there still was not much of a difference when false sharing versus not false sharing.

How can I modify my code to show a significant performance difference for false sharing on cache line vs not false sharing? Am I benchmarking wrong? Any help would be greatly appreciated. I tried to take my time on SO to give as much detail as possible when crafting this question. I hope I am communicating my problem and intention clearly.

chitanda
  • 21
  • 1
  • I think the problem is that you get false sharing in both versions. Try to get each thread to deal with multiples of `std::hardware_constructive_interference_size` bytes. – Ted Lyngmo May 08 '22 at 20:29
  • Even if you don’t have perf on a mac, there must be an online version and your program isn’t super complicated. Worst case scenario, you can use a cloud Linux box. But the beginning middle and end of your story is you need to compare those cpu metrics to see what’s really going on. These tools also exist for windows machines. – Taekahn May 08 '22 at 20:37
  • @Taekahn Perf requires the access to hardware counters. The useful counters are not accessible by default (for security reasons) and one need to be root to modify the `kernel.perf_event_paranoid` sysctl. Hypervisors/VM often disable that for obvious security reasons (as it would enable one user to track what other users do and AFAIK even find their password). There is no free fully equivalent to Perf on Windows (and WSL does not really support it yet) except VTune (which often causes a BSOD on my machine and I guess hardware counters only works for Intel processor). – Jérôme Richard May 08 '22 at 21:34
  • @TedLyngmo I tried to use `std::hardware_constructive_interference_size` but apparently I got an error saying something along the lines of "there is no member hardware_constructive_interference_size in namespace std". A similar thing for `std::std::hardware_constructive_interference_size`. I made sure to `#include ` `#include ` and even `#include `. I tried for both Apple Clang and GCC. Also, and integer is 4 bytes and I have pow(2, 24) integers in my vector with only 2 threads then each thread would be responsible for pow(2, 25) bytes which is a multiple of 128/64. – chitanda May 09 '22 at 19:23
  • My comment could not fit above so 128 is `hw.cachelinesize` for ARM, and 64 for Intel (I have another terminal that runs on Rosetta so when I call `arch` I get `i386`). Does it make a difference if I allocate all the vectors on the heap instead of the stack. I read [here](https://stackoverflow.com/questions/65402150/what-data-will-be-cached) that it won't make a difference so I am still at a loss for a different approach to show the difference. – chitanda May 09 '22 at 19:25
  • @chitanda Yeah, `g++` and `clang++` has refused to implement those. I usually check if `__cpp_lib_hardware_interference_size` is defined before using them. If that feature test macro is not defined, I usually use a value hardcoded for the target, so use the one that suites your target best. – Ted Lyngmo May 09 '22 at 19:30
  • @TedLyngmo I realized that I need to (1) Use arrays instead of vectors and (2) Allocate the array outside the function definition (static I believe). I also used an array of structs where each struct just holds an integer and I noticed that not false sharing was 5x faster than when false sharing. Also, allocating arrays on the heap within a function definition (calling `new` inside a function) also did not show any difference. Thanks for your help. – chitanda May 10 '22 at 19:24
  • @chitanda Wow, nice! I've usually used `vector`s for misc. tasks and made sure that every work package is separated by using `hardware_destructive_interference_size`, but it's not always obvious what makes it really fast (I haven't bothered learning any modern assembler). Great that it seemed to work out! – Ted Lyngmo May 10 '22 at 19:33
  • The "vector vs. array" difference is bogus. Try _not_ using `.at()` to access vector elements. As-is, your use of `.at(i)` _probably_ dominates your execution time, and _hides_ the difference you are trying to measure. – Employed Russian May 15 '22 at 15:00
  • @EmployedRussian I think so too. But I just took the advice I got from someone else in my lab who claimed "vectors were optimized to avoid false sharing" so I switched to arrays. Now that I switched to arrays, I saw 5x speedup for a few days. Today, there is no difference and I don't know why. The goal of this experiment is too see if students can figure out how to avoid false sharing given a simple multi-threaded problem. But if I (as the instructor) can't benchmark it, I'll hard-pressed to drive home this point across in recitation. – chitanda May 15 '22 at 20:42
  • For a concept so common in systems programming (so many articles on it), it is proving quite difficult to benchmark (some days a 5x speedup other days no speedup). It leads me to question the whole "avoiding false sharing" recommendation. – chitanda May 15 '22 at 20:45

0 Answers0