9

In the book Computer Systems: A Programmer's Perspective, the Exercise 5.5 shows a piece of code to compute the value of a polynomial

double poly(double a[], double x, int degree)
{
    long int i;
    double result = a[0];
    double xpwr = x;
    for (i = 1; i <= degree; i++) {
        result += a[i] * xpwr;
        xpwr = x * xpwr;
    }
    return result;
}

The exercise assumes that the clock cycles that are needed by double-precision floating-point addition and multiplication are 3 and 5 respectively. The reader is asked to explain why the measured CPE (Cycles Per Element) value is 5.

According to the exercise answer, in each iteration, we need to update the variables xpwr and result, and the operations we need is a floating-point addition (for result) and a floating-point multiplication (for xpwr), so the latter dominates the latency, causing the ultimate CPE to be 5.

But I think the data flow should be something like this:

xpwr               result
  |                  |
  +-----+ +--[load]  |
  |     | |          |
[mul]  [mul]         |
  |      |           |
  |      +---+ +-----+
  |          | |
  |         [add]
  |           |
  |           +------+
  |                  |
xpwr               result

So the longest path is from the previous value of xpwr to the new value of result, going through the execution units [mul] and [add]. Therefore the longest time should be 8 cycles.

I want to ask

  1. What is exactly the meaning of a critical path? And how to determine it?
  2. Which answer (mine and the book) is more reasonable?

Any explanation about the CPU, architecture, execution units, pipeline, floating-point unit will be appreciated.

pjhades
  • 1,948
  • 2
  • 19
  • 34
  • It seems that a path consists of only the edges connecting the same loop register. With this definition, there're two paths in the data flow, the path of `xprw` which has only `mul` and the path of `result` which has `add`. So the critical path is the one of `xprw`. – ashen Mar 09 '19 at 07:44
  • 1
    Related: a later question about the same code ([Latency bounds and throughput bounds for processors for operations that must occur in sequence](https://stackoverflow.com/q/63095394)), also asking why the latency bottleneck isn't 8 cycles per iteration. My answer there has a correct ASCII diagram of the two parallel dep chains across multiple iterations. – Peter Cordes May 13 '22 at 11:38
  • Oh bother, [computer-architecture] counts as a separate tag, so my [cpu-architecture] gold badge didn't let me close this as a duplicate. – Peter Cordes May 13 '22 at 12:11
  • @PeterCordes I added a cpu-architecture tag. It has been so long so feel free to close. Thank you for the link to the other question! – pjhades May 14 '22 at 18:12
  • The tags are already synonyms, that's why I thought my dup-hammer would work on it. Any edit (even to the non-tag part of the question) would have updated the tags to this. :P Anyway, thanks for finishing the dup-closure. – Peter Cordes May 14 '22 at 18:32

5 Answers5

6

I know I'm a bit late to the party, but the book is absolutely right. As you can verify for yourselves by timing the code, CPE is indeed 5, so the second answer is wrong.

But the first one is also wrong. It says that the MULs must be performed at the same time which is not at all possible in the Nehalem architecture (and I suspect, most modern processors). Remember that there's only one FP MUL unit and a different FP ADD unit (as shown in the book, ed. 2011 and later)

What happens instead is this:

(LOADs are assumed always present, just 1 cycle if in cache)

First we feed xpwr *= x in MUL. Immediately after that we feed xpwr*a[i] (remember the pipeline!)

... after 5 cycles we'll get the new value of xpwr, and after 6 cycles we'll have the result of xpwr*a[i]. At that point, a new computation of xpwr *= x will be at stage 1 of MUL. So we have only 4 more cycles in which to do the rest of the ops if we don't want to be restricted by them.

Of course, that is easy as we only need 3 cycles for the FP ADD to get the new result.

So, it becomes obvious that the limiting factor is the computation of xpwr. Which means that in looking for the critical path (whatever that is) we have to look specifically at the path from the old values to the new. In this case, the path for result consists only of one FP ADD! (that's what threw me off at first too)

  • This is the correct answer. For example, assume this: integer addition: 1 cycle latency, 1 cycle issue double FP addition: 3 cycles latency, 1 cycle issue double FP multiplication: 5 cycles latency, 1 cycle issue load: 4 cycles latency, 1 cycle issue. – lukehsiao Mar 22 '15 at 00:31
  • You say *But the first one is also wrong.* - I don't see in the question where the book's answer implies starting two `mul` in the same cycle; the chart / graph with two `mul` in parallel is part of the querent's guess; they're not saying it's from the book. (Its shape is somewhat correct, but it shows one mul shorter than another. And it maybe implies starting two `mul` in the same cycle; only Haswell and later can do that.) My answer on [a later duplicate](https://stackoverflow.com/q/63095394) has a correct diagram for the same problem. – Peter Cordes May 13 '22 at 12:04
  • *(LOADs are assumed always present, just 1 cycle if in cache)* - No, load-use latency is about 5 or 6 cycles to an XMM register for a hit in L1d cache. The key is that the address is ready ahead of time: the loop-carried dep chain of `i++` is only integer, so it's only 1 cycle and can even be unrolled to less than 1 cycle per FP iteration. So the CPU can execute the integer loop overhead / pointer increment far into future iterations and get those loads started, so the data is ready by the time the other input for `mulsd` is ready (the output of a previous `mulsd`.) – Peter Cordes May 13 '22 at 12:09
2

The critical path is indeed 8 cycles, but the question asks for the CPE, which is like the time-averaged time to output one more cycle of the loop.

Other than the first and the last cycle, the processor can do the addition from the previous iteration of the loop and the current multiplications at the same time, because the operands are not dependent on each other. The first iteration of the loop takes the full 8 cycles, but all iterations after, the loop only takes 5 cycles to run, making the actual CPE 5 cycles.

P.S. I do agree that the book's way of presenting critical path is confusing. Their definition of the critical path is not simply the path that takes the longest path, but the path also has to have operations that have operands that depends on previous operations and so have to be in order. This definition makes finding the critical path rather not intuitive.

Opundo
  • 21
  • 2
  • 'xpwr = x * xpwr;' is the only statement that is not fully independent across iterations, so it's 5-cycle latency is what prevents a computer from processing faster. (Of course, this assumes sufficient support in hardware to exploit the parallelism.) A decent compiler would be able to do better (at least for higher degree) by recognizing that the xpwr calculation can be parallelized--e.g., calculate x^2 in the first iteration, x*x^2 and x^2*x^2 (in parallel) in the second, x^3*x^2 and x^4*x^2 in the third, etc. xpwr is partially a *name* dependence. The book seems inaccurate. –  Feb 22 '14 at 17:15
  • The longest *loop-carried* dependency chain has 5-cycle latency. It's not very meaningful to say the first iteration takes 8 cycles, because out-of-order exec has already started on the independent part of the next iterations (the loads, pointer increment / loop overhead) well before we get to the end of even the first 5 cycle latency `mulsd`. And the `mulsd` in the next iteration overlaps with the multiply and add in this iteration, the CPU doesn't wait 8 cycles on the first iteration before doing any work in the 2nd. The key is loop carried vs. forking off independent work every iter. – Peter Cordes May 13 '22 at 11:58
1

A1: The critical path is the longest one in the data flow graph according to the book, which must be on a straight line and has an effect on a single register, rather than add up 'mul' and 'add', whose results are only intermediate operands for the next operations.

About this question, you'll get it all done if continuing to read the rest. Especially, comparing combine7's data flow graph and combine5's one can be helpful.

A2: If A1 is understood, Question2 is clear, the answer in book is reasonable.

Lin
  • 29
  • 3
1

In one iteration, there are three operation executed in parallel:

  • result += PREV, where PREV equals to a[i] * xwpr, which was calculated in previous iteration, requiring a floating-point addition (3 clock cycles).
  • a[i] * xpwr, whose value will be used in next iteration, requiring a floating-point multiplication (5 clock cycles).
  • xpwr * x, whose value will be used in next iteration, requiring a floating-point multiplication (5 clock cycles).

As you can see, there are no data dependencies among the three operation, so for a superscalar out-of-order CPU, they can execute in parallel, which result in a CPE of 5.

They don't necessarily start in the same clock cycle (and in fact they can't all because the CPU doesn't have 3 FP execution units), but one can start before another finishes, so their latencies can overlap. They're in flight at the same time, in the pipelined FP execution units.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
huibosa
  • 77
  • 1
  • 3
-1

The critical path is the longest path through the graph, in this case eight clocks. This is what the Dragon Book has to say about critical paths (10.3.3 Prioritized Topological Orders):

Without resource constraints, the shortest schedule is given by the critical path, the longest path through the data-dependence graph. A metric useful as a priority function is the height of the node, which is the length of a longest path in the graph originating from the node.

I think you found an error in the book. You should consider contacting the authors, so that they can correct it in future printings.

Mackie Messer
  • 7,118
  • 3
  • 35
  • 40