I have read the below statement somewhere which I cannot really follow -
There is a slight gain in performance for more than 16 and more than 32 cores. The seeds are integer values, i.e., they require 4 bytes of memory. A cache line in our system has 64 bytes. Therefore 16 seeds fit into a single cache line. When going to 17/33 threads, the additional seed is placed in its own cache line so that the threads are not further obstructed.
The code referred for this question is provided below -
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char *argv[]) {
long long int tosses = atoll(argv[1]);
long long int hits = 0;
int threads = atoi(argv[2]);
double start, end;
int i;
unsigned int seeds[threads];
for (i = 0; i < threads; i++)
seeds[i] = i + 1;
start = omp_get_wtime();
#pragma omp parallel reduction(+:hits) num_threads(threads)
{
int myrank = omp_get_thread_num();
long long int local_hits = 0, toss;
double x, y;
#pragma omp for
for (toss = 0; toss < tosses; toss++) {
x = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
y = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
if (x*x + y*y < 1)
local_hits++;
}
hits += local_hits;
}
end = omp_get_wtime();
printf("Pi: %f\n", 4.0 * hits / tosses);
printf("Duration: %f\n", end-start);
return 0;
}
The actual asked question was - Why this code scales so badly over multiple cores?
My questions are as follows:-
- What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
- The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
- How can I relate cache line and block in terms of cache memories?