I am trying to show that avoiding false sharing results in a performance benefit when using two vectors of integers, a reader
vector (values to be read from) and a writer
vector (where I am storing the values).
Assume the readWrite
function below is defined.
void readWrite(std::vector<int> &reader, std::vector<int> &writer, int start, int stop, int step)
{
/*
Read every `step` value from `reader` to `writer`
starting from index `start` up to index `stop` (excluding).
Contract:
readWrite: std::vector<int>&, std::vector<int>&, int, int, int => void
Requirements:
reader.size() == writer.size() == BUFFER_SIZE
start < reader.size(), write.size()
Effects:
Mutates `writer`
*/
for (int i = start; i < reader.size() && i < stop; i += step)
{
for (int j = 0; j < REPS; ++j)
{
writer.at(i) = reader.at(i);
writer.at(i) -= reader.at(i);
writer.at(i) = reader.at(i);
}
}
}
And is used like so:
std::chrono::microseconds mtBench(std::vector<int> &reader, std::vector<int> &writer, std::vector<std::thread> &threads)
{
note->info("Threads scheduling...");
/*`slice` represents how many elements each thread is responsible for*/
int slice = (int)(BUFFER_SIZE / THREAD_COUNT);
auto start = std::chrono::high_resolution_clock::now();
/**
* MT_NOSHARE
* If BUFFER_SIZE = 1024 and THREAD_COUNT = 2, then slice = 512
* tid 0, 1 => start = 0 * 512, 1 * 512 respectively;
* and stop = (0 * 512) + 512, (1 * 512) + 512 respectively
* step = 1 because each thread is reading one element at a time
*
* MT_SHARE
* threads are side by side here
*/
for (int i = 0; i < THREAD_COUNT; ++i)
{
#ifdef MT_NOSHARE
threads.at(i) = (std::thread(readWrite, std::ref(reader), std::ref(writer), i * slice, (i * slice) + slice, 1));
#endif
#ifdef MT_SHARE
threads.at(i) = (std::thread(readWrite, std::ref(reader), std::ref(writer), i, BUFFER_SIZE, THREAD_COUNT));
#endif
}
for (int i = 0; i < THREAD_COUNT; ++i)
{
threads.at(i).join();
}
auto end = std::chrono::high_resolution_clock::now();
note->info("Threads finished...");
std::chrono::microseconds duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
return duration;
}
note
is just something I used for logging (single threaded coloured log) from spdlog
. You may assume that BUFFER_SIZE
is some power of two (I use pow(2,24)) and THREAD_COUNT
is between 2 and 8 (my machine can run 8 threads in parallel as gotten from std::thread().hardware_concurrency()
) and REPS
is some power of two as well (I kept it as pow(2, 3)).
I initialize threads
of size THREAD_COUNT
, reader
to integers from 0 to BUFFER_SIZE - 1
and writer
to a size of BUFFER_SIZE
.
I make 20 calls to mtBench
with the necessary parameters and I take the duration then get the average after 20 calls. However, my logs show me that there is not much difference between the averages when I am false sharing versus when I am not false sharing.
These are snippets of the logs:
When I am indeed false sharing
[info] Max # of threads/cores run parallel - (8)
[info] Multi-threaded Benchmark False Sharing...
[info] Buffer size - (16777216), # of threads - (2)
[info] Iteration - (1)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3524508)
[info] After Clearing writer...
[info] Iteration - (2)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3430828)
[info] After Clearing writer...
..........
[info] Iteration - (20)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3452525)
[info] After Clearing writer...
[info] Mean time (microseceonds) after (20) runs - (3438808.85)
When I am not false sharing
[info] Max # of threads/cores run parallel - (8)
[info] Multi-threaded Benchmark No False Sharing...
[info] Buffer size - (16777216), # of threads - (2), slice size - (8388608)
[info] Iteration - (1)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3435623)
[info] After Clearing writer...
[info] Iteration - (2)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3420221)
[info] After Clearing writer...
.........
[info] Iteration - (20)
[info] Threads scheduling...
[info] Threads finished...
[info] Running sanity checks...
[info] Sanity checks passed...
[info] Time taken (microseceonds) - (3428355)
[info] After Clearing writer...
[info] Mean time (microseceonds) after (20) runs - (3428552.75)
So 3438808.85 microseconds when false sharing and 3428552.75 microseconds when not false sharing. I expected that there would be a significant difference in the averages.
Also these are my machine specifications:
System Version: macOS 12.3 (21E230)
Kernel Version: Darwin 21.4.0
Model Name: MacBook Air
Model Identifier: MacBookAir10,1
Chip: Apple M1
Total Number of Cores: 8 (4 performance and 4 efficiency)
Memory: 16 GB
Also when I run sudo sysctl -a | grep cacheline
in my terminal I get hw.cachelinesize: 128
so I am pretty sure that I would be false sharing quite a bit for the first 32 integers.
I was advised to use perf to view the cache miss rate but apparently it is not available on MacOS. II tried to run my program on an Amazon EC2 Instance Ubuntu2004 using cachegrind from valgrind but I could only get one core on the free tier so my multithreaded program was just sequential so my results are inconsequential. Initially, my readWrite
function did not have the inner for loop with j
. It was just writer.at(i) = reader.at(i);
but I thought that reading and writing from the same index multiple times per thread would allow me to see a significant performance difference but unfortunately I cannot see a significant change. I have also tried my benchmark with THREAD_COUNT
as 4 but there still was not much of a difference when false sharing versus not false sharing.
How can I modify my code to show a significant performance difference for false sharing on cache line vs not false sharing? Am I benchmarking wrong? Any help would be greatly appreciated. I tried to take my time on SO to give as much detail as possible when crafting this question. I hope I am communicating my problem and intention clearly.