Efficient Modulo 3 operation?

Question

Possible Duplicate:
Fast modulo 3 or division algorithm?

Everyone knows that modulo arithmetic can be a huge drawback on performance. Does anyone know of a good alternative for x%3 operations? I know that one exists for x%2, but I really need one for modulo 3 since I want to alternate between three buffers in a for loop.

Thanks!

I suggest you to first check where the bottleneck lies. I highly doubt it's the modulo... — m0skit0, Nov 15 '11 at 19:18
Everyone knows that attempting to out-optimize the compiler's modulo optimizations can be a huge drawback on developer efficiency. — recursive, Nov 15 '11 at 19:25
Unless your code bottlenecks at the modulo, it's going to be a case of premature optimization. — moshbear, Nov 15 '11 at 19:16

Voo · Accepted Answer · 2011-11-15T19:57:07.773

Well instead of the usual "measure it" stuff an actual answer - because that stuff is actually real fun math. Although the compiler could and probably does this as well (at least modern optimizing c++ compilers, javac certainly won't and I've got no idea if the JVM does this) - so better check if it isn't already doing the work for you.

But still fun to know the theory behind the optimization: I'll use assembly because we need the higher 32bit word of a multiplication. The following is from Warren's book on bit twiddling:

n is the input integer we want the modulo from:

li M, 0x55555556   ; load magical number (2^32 + 2) / 3
mulhs q, M, n      ; q = higher word of M * n; i.e. q = floor(M*n / 2^32)
shri t, n, 31      ; add 1 to q if it is negative
add q, q, t

Here q contains the divisor of n / 3 so we just compute the remainder as usual: r = n - q*3

The math is the interesting part - latex would be rather cool here:

q = Floor( (2^32+2)/ 3 * (n / 2^32) ) = Floor( n/3 + 2*n/(3*2^32) )

Now for n = 2^31-1 (largest n possible for signed 32bit integers) the error term is less than 1/3 (and non negative) which makes it quite easy to show that the result is indeed correct. For n = -2^31 we have the correction by 1 above and if you simplify that you'll see that the error term is always larger than -1/3 which means it holds for negative numbers as well.

I leave the proof with the error term bounds for the interested - it's not that hard.

The gcc genrate almost the same as you wrote: int f(int n) { return n / 3; 0: 89 f8 mov %edi,%eax 2: ba 56 55 55 55 mov $0x55555556,%edx 7: c1 ff 1f sar $0x1f,%edi a: f7 ea imul %edx c: 29 fa sub %edi,%edx } e: 89 d0 mov %edx,%eax 10: c3 retq — fghj, Nov 15 '11 at 20:15

Patrick Schlüter · Answer 2 · 2011-11-16T12:52:23.823

6

If it's in a straight loop, no need to calculate a modulo. Hold a second int var that you reset every 3 steps.

int i, bn = 0;

for(i=0; i<whatever; i++) {
  ...
  if(++bn == 3) bn = 0;
}

And that is not a premature optimisation, it's avoiding unecessary calculation.

EDIT: It was stated in OP that he was using a loop to switch between buffers, so my solution looks quite appropriate. As for the downvote, if it was a mistake, no problem.

edited Nov 16 '11 at 12:52

answered Nov 15 '11 at 19:55

Patrick Schlüter

11,394
1
43
48

1

@Stefan If it's really a loop this is almost certainly faster than ANY other possible solution on a modern x86 cpu. Why? Because the branch predictior in modern CPUs (that is for high performance CPUs.. x86, Power and so on - quite certainly not ARM) WILL get such a simple recurring pattern. Hence we replace my above calculation which still includes two imuls, shifts, an add, (or the other possibility with lots of adds, shifts,..) with a branch and a `mov reg, 0` instruction. Why do you think this would be slower? – Voo Nov 15 '11 at 20:10
Sorry my mistake. Shouldnt downvote while waiting for a deployment. I really thought he was calculating modulo through a loop. Can you edit answer a space orsomething so i can undo the downvote? It doesnt answer the exact question for efficient modulo but is probably the best answer for the actual need. – Stefan Nov 15 '11 at 20:17
You mean if it is in a loop and you need the modulo 3 of the loop index. Either way this is most definately premature optimization since it reduces code clarity and more importantly it isn't even guranteed to be an optimization, since the compiler might be able to understand the modulo better then the if and therefore make better code. Or it might not, who knows. If neither the modulo nor the if is optimized and you work on a platform with bad/no branchprdiction the modulo might even be faster due to not incurring pipeline stalls. – Grizzly Nov 15 '11 at 20:21
@Stefan I fear only if he edited the answer in a significant way (no idea what exactly SO defines as significant). But I'm sure a heartfelt sorry is all he will need :) The -2 points aren't that problematic – Voo Nov 15 '11 at 20:23
@Grizzly Even if the mod 3 IS optimized as in my answer it will be significantly (well significantly for a handful instructions - we're talking <30 cycles difference here) faster still to do it this way around. – Voo Nov 15 '11 at 20:24
@Voo In a relatively straight conversation of the instructions that might be true (btw: the compiler might also eliminate the branch and use either cmov or some tricky branchless usage for that too and if it doesn't I wouldn't be conviced which is faster, depends on where the latencies can be better hidden), but the more important point is that the compiler might understand one way to do that better then the other and therefore better optimize the surrounding code (particularly in the presence of auto vectorization, since we are obvoiusly talking about simple loops (otherwise why bother)). – Grizzly Nov 15 '11 at 20:31
@Grizzly I'm pretty sure cmov isn't a great idea on modern x86 CPUs, but I get your point (we also increase register pressure ;) ). Although I doubt that the code would make any problems with vectorization after it's unrolled. We'd have to check but I'd bet that this is actually the fastest possible code - since another user already said that gcc does my optimization it's probably the best solution apart from just using `n % 3` one could get ;) Actually I just thought of some nice additional solution, must try that – Voo Nov 15 '11 at 20:39
@Voo it _might_ be faster to start with `bn = 3` and test `if (--bn == 0) bn = 3`. – Daniel Fischer Nov 15 '11 at 21:18
@Grizzly it is not a premature optimisation, it's an algorthmic change. Replacing a (relatively heavy) algortihm by another a lighter one. As for which is easier to read, this is a matter of taste and experience. Especially in OPs case, he stated he wanted to loop through several buffers, IMO is my solution better for several reasons: it's faster, it's the same performance for any number of buffers and it is nearer to functionality asked for. – Patrick Schlüter Nov 16 '11 at 12:57
and the point of my answer was also to look at the problem from another point of view. Sometimes it can open a whole new venue ione hasn't thought. – Patrick Schlüter Nov 16 '11 at 13:02
Concerning your second comment, where you object with concerns about register pressure, cmov and vectorisation, that makes me actually laugh. Talk about premature optimisation, you're breaking a sweat to find solutions to problems which do not exist yet. If after the change, 1 add+1 cmp instead of 1 division is still a problem, then he can think about solutions – Patrick Schlüter Nov 16 '11 at 13:09
@tristopia: I'd say it makes the intent less clear (not much but still) so it is premature optimization. And what about about vectorization, cmov,... made you laugh? I simply wanted to expess that it is in no way guaranteed that the compiler will actually generate better (or at least not worse) code then for a straight i%3, so it's not more expressive and not necessarily faster which makes it premature optimization in my book. So without knowing quite a bit of information about the optimizations done by the compiler it doesn't make sense to say this vairant is generally faster then using i%3. – Grizzly Nov 16 '11 at 14:21

score 5 · Answer 3 · answered Nov 15 '11 at 19:19

5

If 3 is known at compile time, then the compiler will generate the 'tricks' to do it as efficiently as possible. Modulo takes much longer when the divisor is unknown until run-time.

answered Nov 15 '11 at 19:19

JohnPS

2,518
19
17

Efficient Modulo 3 operation?

3 Answers3

Linked