Why (n mod const) is faster than (const mod n)?

Question

While playing with jmh I came across a weird thing I cannot explain.

@BenchmarkMode(Mode.SingleShotTime)
@Measurement(iterations = 10, batchSize = Integer.MAX_VALUE)
@Warmup(iterations = 5, batchSize = Integer.MAX_VALUE)
@State(Scope.Thread)
public class Tests {
    private int value;

    @Setup(Level.Iteration)
    public void setUp() {
        value = 1230;
    }

    @Benchmark
    public boolean testConstModN() {
        return 12345 % value == 0;
    }

    @Benchmark
    public boolean testNModConst() {
       return value % 12345 == 0;
   }
}

The results are below

Benchmark                  Mode  Cnt   Score   Error  Units
Tests.testConstModN          ss   10  10.789 ± 0.305   s/op
Tests.testNModConst          ss   10   7.550 ± 0.067   s/op

I am running on JDK 1.8.0_101, VM 25.101-b13, Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (family: 0x6, model: 0x3c, stepping: 0x3)

If I set the const equal to the value or if I set the value to 0xffffffff, nothing changes.

Try to run a CONST % number while they are equals and show me the results please :) — Erik, Sep 06 '16 at 19:10
What happens if you set `value` to 12345 and use 1230 as your constant? — GriffeyDog, Sep 06 '16 at 19:12
I added the additional information to the question. It does not matter what the values of `const` and `value` are. — Slonopotamus, Sep 06 '16 at 19:23
If you're on Linux, try using `-prof perfasm` to dump the actual assembly generated for the two functions. I suspect that `testNModConst` has been optimized to replace the division with a multiplication. — Tavian Barnes, Sep 06 '16 at 19:27
If you do your tests in the other order, does that change? Also, see the Constant Folding section in http://java-performance.info/jmh/: `value` is the same every time, and the JVM may notice this and opportunistically do a `if (value == 12345) return the_answer` as the first operation, or something like that. That doc suggests techniques to avoid this. (IDK, I haven't used JMH, I'm seeing this question because of the x86-64 tag. If you can show the actual JIT-compiled asm that runs for each version, I can tell you how many cycles it should take on your Haswell CPU.) — Peter Cordes, Sep 06 '16 at 19:33
A related question: why can I work out (n mod 1000000) faster than I can work out (1000000 mod n)? This doesn't exactly address your question, but perhaps it shows that you might not necessarily expect an operation to still be as easy if you switch the operands. — mwfearnley, Sep 06 '16 at 19:37

Marko Topolnik · Accepted Answer · 2016-09-06T19:36:09.150

7

The CPU's DIV and MOD instructions are very expensive, costing 50 cycles or more. When you use a variable divisor, using these instructions is unavoidable. However, when you use a constant divisor, this can be compiled into a short sequence of much cheaper instructions such as MUL, ADD, and SHR.

Hacker's delight, chapter 10 covers several algorithms that solve this problem.

edited Sep 06 '16 at 19:36

answered Sep 06 '16 at 19:25

Marko Topolnik

195,646
29
319
436

Specifically the following strength reduction is probably at play: https://gmplib.org/~tege/divcnst-pldi94.pdf – Tavian Barnes Sep 06 '16 at 19:26
More likely a modular multiplicative inverse (multiply by a big magic constant, and use the upper half) – Peter Cordes Sep 06 '16 at 19:28
This can be resolved by inspecting the actual machine code emitted from the JIT compiler. I did it once before, maybe I can find that answer... – Marko Topolnik Sep 06 '16 at 19:29
2

Here it is: http://stackoverflow.com/questions/31372565/large-performance-gap-between-cpus-div-instruction-and-hotspots-jit-code/31378410#31378410 – Marko Topolnik Sep 06 '16 at 19:32
2

My favourite treatment is http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html – Tavian Barnes Sep 06 '16 at 19:38
You can see what a C compiler does with the two cases for the OP's constant of 12345 [on the Godbolt compiler explorer](https://godbolt.org/g/SNRZkR). (clang3.8.1 -O3 -mtune=haswell). gcc does essentially the same thing with the same multiplicative modular inverse, but uses the implicit-operand form of (`imul r32`) to produce the high half of the product in edx, instead of `imul r64, r64, imm32` + shift to get the high half of the product in eax. – Peter Cordes Sep 06 '16 at 19:50

score 1 · Answer 2 · answered Sep 06 '16 at 19:11

1

Beware this answer it's only intuition

It's because of the nature of the % operator

In testNModConst() 1230 is less than 12345, so modulus only needs to internally check their size and realize that 12345 is bigger (one operation)

In testConstModN() 12345 is greater than 1230, so modulus will have to do a series of operations (subtraction) to calculate the answer.

It depends on the size of the integer relative to the constant

answered Sep 06 '16 at 19:11

cjds

269
1
3
12

Good explanation. Would be interesting to see the results in a native C or assembly environment. – Beethoven Sep 06 '16 at 19:14
1

Good guess. Unfortunately, it does not depend on the values of n and const. – Slonopotamus Sep 06 '16 at 19:25
2

This answer does not really reflect how modulo operations are performed on a hardware level. – Tavian Barnes Sep 06 '16 at 19:31

Why (n mod const) is faster than (const mod n)?

2 Answers2