11

Sometimes associativity can be used to loose data dependencies and I was curious how much it can help. I was rather surprised to find out that I can nearly get a speed-up factor of 4 by manually unrolling a trivial loop, both in Java (build 1.7.0_51-b13) and in C (gcc 4.4.3).

So either I'm doing something pretty stupid or the compilers ignore a powerful tool. I started with

int a = 0;
for (int i=0; i<N; ++i) a = M1 * a + t[i];

which computes something close to String.hashCode() (set M1=31 and use a char[]). The computation is pretty trivial and for t.length=1000 takes about 1.2 microsecond on my i5-2400 @ 3.10GHz (both in Java and C).

Observe that each two steps a gets multiplied by M2 = M1*M1 and added something. This leads to this piece of code

int a = 0;
for (int i=0; i<N; i+=2) {
    a = M2 * a + (M1 * t[i] + t[i+1]); // <-- note the parentheses!
}
if (i < len) a = M1 * a + t[i]; // Handle odd length.

This is exactly twice as fast as the first snippet. Strangely, leaving out the parentheses eats 20% of the speed-up. Funnily enough, this can be repeated and a factor of 3.8 can be achieved.

Unlike java, gcc -O3 chooses not to unroll the loop. It's wise choice since it wouldn't help anyway (as -funroll-all-loops shows).

So my question1 is: What prevents such an optimization?

Googling didn't work, I got "associative arrays" and "associative operators" only.

Update

I polished up my benchmark a little bit and can provide some results now. There's no speedup beyond unrolling 4 times, probably because of multiplication and addition together taking 4 cycles.

Update 2

As Java already unrolls the loop, all the hard work is done. What we get is something like

...pre-loop
for (int i=0; i<N; i+=2) {
    a2 = M1 * a + t[i];
    a = M1 * a2 + t[i+1];
}
...post-loop

where the interesting part can be rewritten like

    a = M1 * ((M1 * a) + t[i]) + t[i+1]; // latency 2mul + 2add

This reveals that there are 2 multiplications and 2 additions, all of them to be performed sequentially, thus needing 8 cycles on a modern x86 CPU. All we need now is some primary school math (working for ints even in case of overflow or whatever, but not applicable to floating point).

    a = ((M1 * (M1 * a)) + (M1 * t[i])) + t[i+1]; // latency 2mul + 2add

So far we gained nothing, but it allows us to fold the constants

    a = ((M2 * a) + (M1 * t[i])) + t[i+1]; // latency 1mul + 2add

and gain even more by regrouping the sum

    a = (M2 * a) + ((M1 * t[i]) + t[i+1]); // latency 1mul + 1add
Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • 2
    "Strangely, leaving out the parentheses eats 20% of the speed-up." If you mean the `{` and `}` of the `for` loop, this is quite impossible: it compiles to the same code either way; which casts doubt on your testing methodology. If you mean the `(` and `)` in the expression inside the loop, omitting them changes the meaning of the expression, so you are comparing apples and oranges. – user207421 Feb 13 '14 at 05:17
  • 2
    I marked them now. No, it doesn't, it's like `x + (y + z)` with `x = M2 * a`, `y = M1 * t[i]`, and `z = t[i+1]` (unless I'm completely blind). The problem is that it lengthens the dependency chain containing `a` and the compiler doesn't care. – maaartinus Feb 13 '14 at 05:28
  • I don't think anything specifically prevents the optimization; they just haven't implemented it yet. – Boann Feb 13 '14 at 10:49
  • The optimizer first and foremost looks for common subexpressions it can pull out of the loop. Your code has none. Beyond that, since loop unrolling can cause a code size explosion, some JITC implementations will delay unrolling the loop until it has proved to be a "hot spot" (and C compilers will similarly avoid unrolling unless told to use super-high optimization). – Hot Licks Feb 13 '14 at 18:27
  • (From an optimization point of view, the expression in your loop is pretty ugly.) – Hot Licks Feb 13 '14 at 18:28
  • @HotLicks: But the HotSpot *did* unroll the loop, but without any gain. And so did gcc with `-O3 -funroll-all-loops`, but again, no gain. And yes, there's nothing to be moved out of the loop, but you should complain to the author of `String.hashCode` which has inspired me to do this. ;) – maaartinus Feb 13 '14 at 18:34
  • Keep in mind that optimizers are constrained to preserve side-effects, and sometimes it's hard to predict what may be a side-effect. Eg, while it's obvious to you and me, it may not be obvious to the compiler that changing `a` does not alter the array `t`. – Hot Licks Feb 13 '14 at 18:41
  • (It's also true that the Hotspot JITC kinda sucks at optimization. Dunno about GCC.) – Hot Licks Feb 13 '14 at 18:43

1 Answers1

4

Here is how I understand your two cases: In the first case, you have a loop that takes N steps; in the second case you manually merged two consecutive iterations of the first case into one, so you only need to do N/2 steps in the second case. Your second case runs faster and you are wondering why the dumb compiler couldn't do it automatically.

There is nothing that would prevent the compiler from doing such an optimization. But please note that this re-write of the original loop leads to larger executable size: You have more instructions inside the for loop and the additional if after the loop.
If N=1 or N=3, the original loop is likely to be faster (less branching and better caching/prefetching/branch prediction). It made things faster in your case but it may make things slower in other cases. It is not a clear cut whether it is worth doing this optimization and it can be highly nontrivial to implement such an optimization in a compiler.

By the way, what you have done is very similar to loop vectorization but in your case, you did the parallel step manually and plugged-in the result. Eric Brumer's Compiler Confidential talk will give you insight why rewriting loops in general is tricky and what drawbacks / disadvantages there are (larger executable size, potentially slower in some cases). So compiler writers are very well aware of this optimization possibility and are actively working on it but it is highly nontrivial in general and can also make things slower.


Please try something for me:

int a = 0;
for (int i=0; i<N; ++i) 
  a = ((a<<5) - a) + t[i];

assuming M1=31. In principle, the compiler should be smart enough to rewrite 31*a into (a<<5)-a but I am curious if it really does that.

Ali
  • 56,466
  • 29
  • 168
  • 265
  • Good points. For `M=32`, the [results are quite different](https://microbenchmarks.appspot.com/runs/2a0514a8-5f31-42aa-ae52-6e7ad0c4c08e). Obviously, the shift gets used, the maximum speedup factor is 2 as shift+add together take 32 cycles. What wonders me is the slowdown for 3-times unrolling. – maaartinus Feb 13 '14 at 17:34
  • @maaartinus Oooops, that was a typo, sorry. It's now fixed. I meant `M1=31` and not `32`. – Ali Feb 13 '14 at 17:40
  • @maaartinus Please also try setting a bigger value for reps, say `50`, and please also reverse the order: `timeMul5(50), timeMul4(50), ... , timeMul1(50)`. I would like to know how stable are those timings under perturbations. – Ali Feb 13 '14 at 17:49
  • OK, both Java and gcc use the shift, but I can't see any gain there: 3 simple instructions instead of 1 `imul`. – maaartinus Feb 13 '14 at 17:51
  • @maaartinus OK, thanks, the shift stuff was only a side-note. What happens if you increase `reps` to 50 / you reverse the order of your tests? – Ali Feb 13 '14 at 17:55
  • Those `timeMul1(3)` you can see there are *not* used in the benchmark, it's just for a correctness test. The real number of repetitions is controlled by caliper and it uses some thousands or millions so that the timing can be reliable measured. The order doesn't actually exist as each experiment gets its own JVM. – maaartinus Feb 13 '14 at 17:56
  • The video you link is fantastic, but the complications mentioned there don't apply to my simple problem. I'll post it to the JDK compiler mailing list and then update my question. – maaartinus Feb 19 '14 at 06:23
  • @maaartinus OK, please let me know in comments. – Ali Feb 19 '14 at 13:24