1

I have read the below statement somewhere which I cannot really follow -

There is a slight gain in performance for more than 16 and more than 32 cores. The seeds are integer values, i.e., they require 4 bytes of memory. A cache line in our system has 64 bytes. Therefore 16 seeds fit into a single cache line. When going to 17/33 threads, the additional seed is placed in its own cache line so that the threads are not further obstructed.

The code referred for this question is provided below -

    #include <stdlib.h>
    #include <stdio.h>
    #include <omp.h>
    int main(int argc, char *argv[]) {

     long long int tosses = atoll(argv[1]);
     long long int hits = 0;
     int threads = atoi(argv[2]);
     double start, end;
     int i;
     unsigned int seeds[threads];
     for (i = 0; i < threads; i++)
     seeds[i] = i + 1;
     start = omp_get_wtime();
     #pragma omp parallel reduction(+:hits) num_threads(threads)
    
 {
     int myrank = omp_get_thread_num();
     long long int local_hits = 0, toss;
     double x, y;
     #pragma omp for
     for (toss = 0; toss < tosses; toss++) {
     x = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
     y = rand_r(&seeds[myrank])/(double)RAND_MAX * 2 - 1;
     if (x*x + y*y < 1)
     local_hits++;

 }

     hits += local_hits;

 }

     end = omp_get_wtime();
     printf("Pi: %f\n", 4.0 * hits / tosses);
     printf("Duration: %f\n", end-start);
     return 0;
  }

The actual asked question was - Why this code scales so badly over multiple cores?

My questions are as follows:-

  1. What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?
  2. The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?
  3. How can I relate cache line and block in terms of cache memories?
Jeet
  • 359
  • 1
  • 6
  • 24
  • The context is clearly missing to fully understand the sentence. What seed are you talking about? What the threads are supposed to do? There is no attached code. What a "block" is supposed to mean in this context? The definition is not provided. – Jérôme Richard Jan 23 '23 at 10:02
  • @JérômeRichard- Hi! Thanks for your response...Please do let me know if you need any further inputs or context! – Jeet Jan 23 '23 at 13:26
  • 1
    Well, readable code might be useful:( – Martin James Jan 23 '23 at 16:01
  • @MartinJames - What do you mean by the readable code? Do you mean proper indentation of the code? – Jeet Jan 23 '23 at 20:04

1 Answers1

0

The seeds are integer values, i.e., they require 4 bytes of memory

This is not always true though it is often the case on most platforms. The C/C++ languages does not prevent sizeof(int) to be 8 or even 2 for example.

Therefore 16 seeds fit into a single cache line.

While this is true, there is no alignment requirements of the seeds array besides it must be aligned to sizeof(unsigned int). This means the array can theoretically cross two cache lines even with 16 seeds. The alignment is dependent of the target platform and more specifically the target compiler. In C11, you can use alignas to specify some alignment constraints to ensure the array is aligned to a 64-byte cache line.

This is an important point since adding a new seed will not impact the result much if the array is not aligned in memory. Moreover, threads can work twice faster if 8 seeds are on a cache line and 8 others are on another cache line (assuming there is no NUMA effects making things even more complex).

Lets assume the array alignment is 64 bytes.

What is conveyed by the above statement? The cache line for 17th/33rd core can be also invalidated correct? So how is it different from the cores 1 to 16?

With 17 seeds and 17 threads, the 16 threads will continue to compete for the same cache line. This effect is called cache-line bouncing and it makes threads run much more slowly. Meanwhile, the new other thread can operate on its own cache line resulting in a much faster execution (only for this specific thread). AFAIK, there is no reason to believe the that other thread will not be obstructed anymore assuming the array alignment is the same.

The own independent memory of the threads (stack memory/private memory) is a part of the cache memory or the main memory?

On modern compute the memory is nearly always cached in usual use-cases. One can bypass the cache using specific instructions (eg. non-temporal instructions though it is an hint) or a special kind of allocated memory (see write-combining memory), but standard accesses on standard allocated memory (eg. using malloc) is always cached on mainstream architectures. This is also true for the stack and the private memory.

How can I relate cache line and block in terms of cache memories?

I am not sure to understand your question. If you want to know where byte in RAM is associated to which cache line then... well... put it shortly you cannot. Modern processors use multi-level N-way set-associative caches. Based on an address, you can determine the index of the cache block, but not easily which in way of the cache. The later is dependent of the cache replacement policy. The thing is this algorithm is not officially well-documented for mainstream processors. For example Intel processors appears to use an adaptative replacement policy and no longer a pseudo-LRU strategy. One should also consider the mapping of virtual addresses VS to physical one. On top of that, Intel processors use an undocumented hash-based strategy so to uniformly distribute the accesses (physical addresses) to the L3 cache. Thus, if you really want to do that, then you need to pick a specific processor micro-architecture and a specific cache level in the first place for the problem to be manageable. You will certainly need a lot of time to try to understand and simulate what the target processors actually do. Alternatively, you can consider an unrealistic LRU replacement method to avoid the complexity (madness?) of real-world mainstream processors.


Related document/posts:

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59