How could I do loop unrolling for this function?

Question

I am trying to do loop unrolling using parallel accumulators but I am confused by the dependencies and calculations. The result and result2 should ideally run independent of each other in order to make use of CPU microarchitecture pipelined design, so that they can be executed in parallel. Meaning the following code should not be used:

for (i = degree - 1; i >= 1; i-=2)
{
    result = a[i] + x * result;
    result2 = a[i-1] + x * result; //same dependency
}

Original function:

double polyh(double a[], double x, long degree)
{
    long i;
    double result = a[degree];
    for (i = degree - 1; i >= 0; i--)
    {
        result = a[i] + x * result;
    }
    return result;
}

INTO:

double poly_opt(double a[], double x, long degree)
{
    long i;
    double result = a[degree];
    double result_2 = 0;
    double result_array[2] = {result, result_2};

    double xpwr_1 = x;     // 1
    double xpwr_2 = x * x; // 2
    double xpwr_array[2] = {xpwr_1, xpwr_2};
    for (i = degree - 1; i >= 1; i -= 2)
    {
        result = a[i] + xpwr_1 * result;
        result_2 = a[i - 1] + xpwr_2 * result_2;

        xpwr_1 = xpwr_1 * x * x;
        xpwr_2 = xpwr_2 * x * x;
    }

    // leftover when input not multiple of loop unrolling factor
    for (; i >= 0; --i)
    {
        result = a[i] + xpwr_1 * result;
        xpwr_1 = x * xpwr_1;
    }
    return result * result_2;
}

I am trying to introduce a 2nd variable result2 to do a loop unrolling of 2 but I was unable to get the result as the code above does not work.

This was an attempt without using parallel accumulators and loop unrolling factor of 8 which works slightly but I would like introduce new variables in hopes of speeding it up:

double poly_opt(double a[], double x, long degree)
{
    long i;
    double result = a[degree];
    for (i = degree - 1; i >= 8; i -= 9)
    {
        result = a[i - 8] + (a[i - 7] + (a[i - 6] + (a[i - 5] + (a[i - 4] + (a[i - 3] + ((a[i - 2] + (a[i - 1] + (a[i] + x * result) * x) * x) * x)) * x) * x) * x) * x) * x;
    }

    // leftover when input not multiple of loop unrolling factor
    for (; i >= 0; --i)
    {
        result = a[i] + x * result;
    }

    return result;
}

I am testing with these data sets:

double a[] = {1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0, 1.0, 2.0, 3.0};
double x = 0.5;
long degree = 359;

int main()
{
    std::cout << polyh(a, x, degree) << std::endl;
    return 0;
}

The correct output should be: 3.1428571429

"I am trying to introduce a 2nd variable result2 to do a loop unrolling of 2 but I was unable to get the result." Please post a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). — MikeCAT, Mar 06 '23 at 13:15
with `for (i = degree - 1; i >= 0; i--)` the last iteration is with `i=0`, what does `a[i - 1]` access? — mch, Mar 06 '23 at 13:23
It's undefined behaviour, but can you just post the code you use, and not fiddle with it? — Weather Vane, Mar 06 '23 at 13:23
Semi-related: [Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?](https://stackoverflow.com/q/72306573) for unrolling with multiple dep chains with a non-trivial recurrence. (evaluating `poly(i+1)` in terms of `poly(i)` instead of independently from scratch, for 2nd-order polynomials. See my answer for working code). Also related re: perf analysis of a rolled-up `xpwr` loop: [Latency bounds and throughput bounds for processors for operations that must occur in sequence](https://stackoverflow.com/q/63095394) — Peter Cordes, Mar 06 '23 at 14:08
Test with a smaller input, much lower degree like 3 to 4 coefficients, so you can explore the off-by-one cases in the first or second loop iteration when you single-step. Also, an `x` value like `10.0` might make it easy to see which terms of the polynomial got calculated correctly, with 1-digit coefficients each term will give the value of one decimal digit of the result. Or maybe x=100 so even if it's wrong, it's maybe not disturbing the neighbouring digits as much. — Peter Cordes, Mar 06 '23 at 14:11
Dumb question: Did you profile the code to determine if this is even your biggest slow-down? — Casey, Mar 23 '23 at 21:16
@Casey: The data dependency is the bottleneck for a large polynomial evaluated using Horner's rule (the original `polyh`), especially on a machine without FMA so the critical path latency of the loop-carried dependency chain is `addsd` + `mulsd` latency. e.g. 8 cycles total on Haswell (like the machine discussed in the CS:APP exercises this code is from, see links in my earlier comment). During those 8 cycles, it has the throughput to start 8 of each op instead of 1 of each. Estrin's Scheme can create some more ILP (instruction level parallelism) with minor extra overhead vs. OP's strategy. — Peter Cordes, Mar 24 '23 at 03:10
`xpwr_2 = xpwr_2 * x * x;` is doomed to inefficiency if you compile with strict FP math (not `-ffast-math`.) It has to evaluate `(xpwr_2 * x) * x;` due to order of operations; FP math isn't strictly associative due to rounding errors, so the compiler can't optimize it to one multiply with a loop-invariant x^2 unless you write it as `xpwr_2 *= x * x;` or `xpwr_2 = xpwr_2 * (x * x);` if you insist on not using `*=`. Or since you strongly want this optimization, [CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination) and hoist that expression yourself with `double x2 = x*x` — Peter Cordes, Mar 24 '23 at 14:03

How could I do loop unrolling for this function?

0 Answers0