0

I'm working on this practice question 5.5 in the textbook:

Randal E. Bryant, David R. O’Hallaron - Computer Systems. A Programmer’s Perspective [3rd ed.] (2016, Pearson).

Here is the problem description: enter image description here

In B, my idea is that CPE for this func is 8n because the performance-limiting computation is result, which requires floating point mul and floating point add. However, the solution says it's 5n due to xpwr=x * xpwr.

Can someone help explain what's wrong in my idea?

Figure 5.12: enter image description here

enter image description here

This is the data-flow I drew.

HanayoZz
  • 33
  • 5
  • 1
    Again, more info is needed. The description is _barely_ readable (esp. the floating point numbers). But, fmul (lat=5, issue=1, cap=2) and fadd (lat=3, issue=1, cap=1). Only _one_ fadd in parallel, but _two_ for fmul??? Weird--usually there are more add units than mul units. Cherry picked because of the A section. Since two mul in parallel, we only need to count one. So, I'd say initial lat of 5 for mul, then the issue is overlapped with the lat of the 2nd mul, so 5 instead of 6??? And, add done in parallel(?) overlapped with the mul?. So, 5 dominates. – Craig Estey Jan 03 '23 at 02:59
  • @CraigEstey thanks for the explanation! I guess now I can rephrase my question specifically. Why can the add be done in parallel with the Mul? Looks like the var result need the result from Mul before doing add. – HanayoZz Jan 03 '23 at 03:10
  • 1
    My guess: the _first_ add might have to wait (for mul), but, subsequent ones overlap (in independent add unit). So, the steady state (pipeline fully up and running) would be 5 for mul. The add is lat=3 + issue=1 --> 4. The 4 is less than the 5 for the steady state mul!?!? – Craig Estey Jan 03 '23 at 03:15
  • The answer to 5.5 B is the performance-limiting computation is xpwr=x * xpwr. And it mentions that the updating of result (the variable in the loop) only requires a fadd (3) between iterations. I was confused about why only 3 is needed. – HanayoZz Jan 03 '23 at 03:24
  • @CraigEstey can you explain a bit more why the later ones can be overlapped? – HanayoZz Jan 03 '23 at 03:31
  • 1
    @CraigEstey: This pipeline is modeled after Haswell; multiplies can run on either FMA unit, but there's a dedicated `vaddps` / `addss` unit with lower latency, but only on one port. ([Why does Intel's Haswell chip allow floating point multiplication to be twice as fast as addition?](https://electronics.stackexchange.com/a/452366)) Intel changed that up with Skylake, and later re-added lower-latency FP-add units in Alder Lake with 2/clock throughput. See also the linked duplicates for detailed explanation of the same textbook question, the same why is it 5 instead of 8 cycles per iteration. – Peter Cordes Jan 03 '23 at 04:00

0 Answers0