Understanding the efficiency of different layer/stages of pipelines when calculating a sum

Question

So I have the results for some experiments and I need to write about the efficiency of pipelining.

I don't have the source code but I do have the time it took for a 4 layer pipeline and an 8 layer pipeline to sum an array of 100,000,000 doubles.

The sum was performed the following way.

For the 4-layer pipeline

d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
for (int i = 0; i < N; i = i + 4) {
    d0 = d0 + a[i + 0];
    d1 = d1 + a[i + 1];
    d2 = d2 + a[i + 2];
    d3 = d3 + a[i + 3];
}
c = d0 + d1 + d2 + d3;

for the 8 layer pipeline

d0 = 0.0; d1 = 0.0; d2 = 0.0; d3 = 0.0;
d4 = 0.0; d5 = 0.0; d6 = 0.0; d7 = 0.0;
for (int i = 0; i < N; i = i + 8) {
    d0 = d0 + a[i + 0];
    d1 = d1 + a[i + 1];
    d2 = d2 + a[i + 2];
    d3 = d3 + a[i + 3];
    d4 = d4 + a[i + 4];
    d5 = d5 + a[i + 5];
    d6 = d6 + a[i + 6];
    d7 = d7 + a[i + 7];
}
c = d0 + d1 + d2 + d3 + d4 + d5 + d6 + d7;

The results I have show the following time values for No pipeline , 2 layer pipeline , 4 layer pipeline and 8 layer pipeline. The code for the no pipeline and 2 -layer pipeline is similar to the ones I showed above. The results are averaged over 10 runs and are as follows. The experiment was run in an Intel Core i7-9750H Processor.

No Pipeline : 0.106 secs
2-Layer-Pipeline: 0.064 secs
4-Layer-Pipeline: 0.046 secs
8-Layer-Pipeline: 0.048 secs

It is evident that from no pipeline to 4 pipeline the effiency gets better but I'm trying to think of ways as to why the efficiency actually got worst from the 4-layer pipeline to the 8 layer-pipeline. Considering that the sum is done by different registers then there shouldn't be any type of dependency hazard affecting the values. One Idea that I had is that maybe there aren't enough ALUs to process more than 4 floating point numbers at one time and this causes stalls but then wouldn't it at least perform better than the 4 stage pipeline. I have plotted the processes in excel to try to find where the stalls/bubbles are happening but I can't see any.

What compiler and options are you using? Your CPU has FP add latency of 4 cycles, throughput of 2/clock (for scalar or for SIMD 2 or 4-wide), so 8 (vector) accumulators is barely enough to hide FP latency. (See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) re: the related problem of FP dot product). You should be seeing a speedup with an unroll factor of 8, unless you disabled optimization, even if the compiler rolled back up into SIMD vectors. — Peter Cordes, Jul 14 '21 at 02:38
But with optimization disabled, store forwarding throughput (1 store per clock) will bottleneck your performance at 4 accumulators, so that's probably what you did. — Peter Cordes, Jul 14 '21 at 02:39
I don't really have access to the code itself since it was given to us by our professor. But I don't think we need to consider most of those things when talking about the results. We did go over loop unrolling but I'm not sure how that would work in here. — Sam, Jul 14 '21 at 02:52
Because basically this code is not really the code that is run but rather a code to try to create a model of how many operations are done by the CPU. For example , add operations for the 8 layer Pipeline would be 6 +9*N/8 but for the 4 layer pipeline they would be 3 +5N/4 . — Sam, Jul 14 '21 at 03:02
CPUs run machine code, not C. Also, you can't just count add operations, that makes no sense when the bottleneck is latency of the loop-carried dependency chain through each `d0..dN`, not throughput. It also makes no sense to consider integer add the same as FP add, because that CPU has an extra execution port with an integer ALU. (Similar setup to Haswell, https://www.realworldtech.com/haswell-cpu/4/, but with FP add on port 0 and port 1: [Why does Intel's Haswell chip allow floating point multiplication to be twice as fast as addition?](https://electronics.stackexchange.com/a/452366)) — Peter Cordes, Jul 14 '21 at 03:35
An unroll factor of 8 is just barely enough to hide FP latency, and more is better, on that specific CPU model, assuming you're not introducing other bottlenecks. Core i7-9750H is a 4-wide superscalar CPU with aggressive out-of-order execution, so it's not something you can put in excel and look at how instructions will go through the pipeline. Register renaming avoids all hazards except true dependencies (RAW). — Peter Cordes, Jul 14 '21 at 03:37
But most importantly, `d0..dN` *aren't* kept in registers if you don't compile with optimization, and that's very likely what's happening. e.g. https://godbolt.org/z/MMWqdaj1E shows `gcc -O2` (efficient scalar, but not auto-vectorized) vs. `gcc -O0` output (garbage spilling every C variable between C statements, which would explain the plateau at an unroll of 4, given FP latency = 4 and 1/clock store throughput) — Peter Cordes, Jul 14 '21 at 03:38
So yeah, *Considering that the sum is done by different registers* - that's most likely your false assumption. Doing it by store/reload to memory would fully explain your results. The 0.046 vs. 0.048 is probably just measurement noise, or else some secondary or tertiary effect. It's likely not [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) since unroll by 8 gives it *more* time to store-forward. — Peter Cordes, Jul 14 '21 at 03:41
Hmm, but store-forwarding would add latency to the dep chain, so actually unrolling more than 4 would still help, you wouldn't be hitting the 1/clock store bottleneck. Without details on how you compile (or a listing / disassembly of the loop you actually benchmarked), it's still not answerable / not a [mcve], though. — Peter Cordes, Jul 14 '21 at 03:45
Wow thank you for your replies. I have considered everything so I did think that of course with the information given it would be really hard to make assumptions about the expected time. Nevertheless, your replies did help me see the difference between the theoretical pipeline improvement vs the real improvement. Thank you for your help! — Sam, Jul 14 '21 at 03:52
Even your best-case throughput is not great for that CPU: 100000000 doubles / 0.046 seconds / 4.5e9 GHz = 0.48 doubles per clock cycle (assuming max turbo). Even without using SIMD instructions, it can go 4x faster than that (starting two FP adds per clock). With fast enough RAM, that shouldn't have been a bottleneck: `8 bytes * 100000000 / 0.046 sec` = 17.4 GB/s, less than half the theoretical mem BW of 41.8 GHz, so even a single-channel i7-9750H system could manage that. https://ark.intel.com/content/www/us/en/ark/products/191045/intel-core-i7-9750h-processor-12m-cache-up-to-4-50-ghz.html — Peter Cordes, Jul 14 '21 at 03:53
(And yes, a single core can come close to maxing out the memory controllers on recent Intel "client" chips, unlike on many-core Xeon). If you compiled your 8x unroll with `gcc -O2`, it should run better than 1 add per clock cycle, probably close to 2. With `clang -O2` it might auto-vectorize and run much faster. Or just hit a memory bottleneck. — Peter Cordes, Jul 14 '21 at 03:57
It's fairly straightforward to make a rough prediction of performance for a given asm loop by analyzing dep chains. [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) / [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) explain more about how to do that for modern out-of-order exec CPUs. Or to get LLVM-MCA to do it for you for a compiler-generated asm loop. — Peter Cordes, Jul 14 '21 at 03:58
BTW, thinking about this some more and looking at the asm, probably you bottleneck on *front-end* throughput with un-optimized builds, since GCC uses 10 (single-uop) instructions per statement, not one. (Including a separate `cqde` sign-extension of `i` after a dword load, since it's not pointer-width, and a bunch of other dumb stuff). And the front-end is only 4-wide, so that's slow enough to hide the latency bottleneck even with store-forwarding once you get to an unroll of 4. — Peter Cordes, Jul 14 '21 at 23:40

Understanding the efficiency of different layer/stages of pipelines when calculating a sum

0 Answers0