0

I have the following code snippet, of which I have to calculate the Arithmetic Intensity.

const int N = 8192;
float a[N], b[N], c[N], d[N];
...
#pragma omp parallel for simd 
for(int i = 0; i < N; i++)
{
 const float tmp_a = a[i];
 const float tmp_b = b[i];
 c[i] += tmp_a*tmp_b;
 d[i] = tmp_a+tmp_b;
}

Case 1 : What will be the AI if tmp_a and tmp_b are in registers Case 2 : What will be the AI if tmp_a and tmp_b are in RAM or cache

I know AI is given as number of floating point operations divided by the number of bytes transferred. How should the bytes transferred depend on the data being stored in RAM/registers/Cache? What additional information do we need to calculate the maximum floating point throughput achievable by the code?

Jeet
  • 359
  • 1
  • 6
  • 24
  • 1
    Case 2 sounds like some nonsense like compiling without optimization, spilling those variables to the stack and presumably reloading them twice each, once for the `*` and once for the `+`. That hardly seems worth considering; they're local temporaries that only live for one loop iteration. But yes, if a braindead compiler did spill/reload them, that would cost extra load-port bandwidth. – Peter Cordes Jan 12 '23 at 11:41
  • Should we consider the bytes transferred from registers or Cache? Or only the bytes transferred from the Main memory will be considered? What additional information do we need to calculate the maximum floating-point throughput? – Jeet Jan 12 '23 at 12:33
  • There are multiple ways to define computational intensity; either as ALU work per byte transferred between RAM and L1d cache, or as ALU work per load or store uop in the compiled asm if you're worried about back-end execution units. ALU work per byte loaded/stored only makes sense if all of them are equally likely to miss in cache, e.g. a store/reload of an 8-byte `double` local var is no more expensive than a store/reload of a 4-byte `float` (because CPUs have load/store units that are typically 32 or 64 **bytes** wide), but looping through an *array* of `double` touches twice as much memory. – Peter Cordes Jan 12 '23 at 12:40
  • Additional info for peak FP throughput: you need to know if the code vectorizes with SIMD (SSE or AVX for x86, or even AVX-512, to do 16 bytes of work at a time). And you need to know if it was able to use FMA instructions to contract `x + y*z` into one operation. And you need the theoretical max throughputs of those operations on your CPU, e.g. from https://uops.info/ for various modern x86 CPUs. e.g. Skylake has two 256-bit wide FMA units (8 floats at once), fully pipelined, so it can start 16 FMA, MUL, ADD, or SUB operations per clock cycle. – Peter Cordes Jan 12 '23 at 12:43
  • But Skylake can also only load 2 vectors and store 1 per clock. (Assuming hits in L1d cache). Alder Lake can do 3 loads and 2 stores per clock cycle, Ice Lake 2 and 2. (This is on 256-bit vectors, or scalars.) Look at compiler output and count uops (https://agner.org/optimize/), or use https://uica.uops.info/ to analyze the asm for theoretical max throughput assuming everything hits in L1d cache. – Peter Cordes Jan 12 '23 at 12:45
  • See also [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) for more about static performance analysis (of asm, since that's what CPUs actually run). – Peter Cordes Jan 12 '23 at 13:39

0 Answers0