6

While playing with jmh I came across a weird thing I cannot explain.

@BenchmarkMode(Mode.SingleShotTime)
@Measurement(iterations = 10, batchSize = Integer.MAX_VALUE)
@Warmup(iterations = 5, batchSize = Integer.MAX_VALUE)
@State(Scope.Thread)
public class Tests {
    private int value;

    @Setup(Level.Iteration)
    public void setUp() {
        value = 1230;
    }

    @Benchmark
    public boolean testConstModN() {
        return 12345 % value == 0;
    }

    @Benchmark
    public boolean testNModConst() {
       return value % 12345 == 0;
   }
}

The results are below

Benchmark                  Mode  Cnt   Score   Error  Units
Tests.testConstModN          ss   10  10.789 ± 0.305   s/op
Tests.testNModConst          ss   10   7.550 ± 0.067   s/op

I am running on JDK 1.8.0_101, VM 25.101-b13, Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (family: 0x6, model: 0x3c, stepping: 0x3)

If I set the const equal to the value or if I set the value to 0xffffffff, nothing changes.

Slonopotamus
  • 201
  • 1
  • 5
  • Try to run a CONST % number while they are equals and show me the results please :) – Erik Sep 06 '16 at 19:10
  • What happens if you set `value` to 12345 and use 1230 as your constant? – GriffeyDog Sep 06 '16 at 19:12
  • I added the additional information to the question. It does not matter what the values of `const` and `value` are. – Slonopotamus Sep 06 '16 at 19:23
  • 1
    If you're on Linux, try using `-prof perfasm` to dump the actual assembly generated for the two functions. I suspect that `testNModConst` has been optimized to replace the division with a multiplication. – Tavian Barnes Sep 06 '16 at 19:27
  • If you do your tests in the other order, does that change? Also, see the Constant Folding section in http://java-performance.info/jmh/: `value` is the same every time, and the JVM may notice this and opportunistically do a `if (value == 12345) return the_answer` as the first operation, or something like that. That doc suggests techniques to avoid this. (IDK, I haven't used JMH, I'm seeing this question because of the x86-64 tag. If you can show the actual JIT-compiled asm that runs for each version, I can tell you how many cycles it should take on your Haswell CPU.) – Peter Cordes Sep 06 '16 at 19:33
  • A related question: why can I work out (n mod 1000000) faster than I can work out (1000000 mod n)? This doesn't exactly address your question, but perhaps it shows that you might not necessarily expect an operation to still be as easy if you switch the operands. – mwfearnley Sep 06 '16 at 19:37

2 Answers2

7

The CPU's DIV and MOD instructions are very expensive, costing 50 cycles or more. When you use a variable divisor, using these instructions is unavoidable. However, when you use a constant divisor, this can be compiled into a short sequence of much cheaper instructions such as MUL, ADD, and SHR.

Hacker's delight, chapter 10 covers several algorithms that solve this problem.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • Specifically the following strength reduction is probably at play: https://gmplib.org/~tege/divcnst-pldi94.pdf – Tavian Barnes Sep 06 '16 at 19:26
  • More likely a modular multiplicative inverse (multiply by a big magic constant, and use the upper half) – Peter Cordes Sep 06 '16 at 19:28
  • This can be resolved by inspecting the actual machine code emitted from the JIT compiler. I did it once before, maybe I can find that answer... – Marko Topolnik Sep 06 '16 at 19:29
  • 2
    Here it is: http://stackoverflow.com/questions/31372565/large-performance-gap-between-cpus-div-instruction-and-hotspots-jit-code/31378410#31378410 – Marko Topolnik Sep 06 '16 at 19:32
  • 2
    My favourite treatment is http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html – Tavian Barnes Sep 06 '16 at 19:38
  • You can see what a C compiler does with the two cases for the OP's constant of 12345 [on the Godbolt compiler explorer](https://godbolt.org/g/SNRZkR). (clang3.8.1 -O3 -mtune=haswell). gcc does essentially the same thing with the same multiplicative modular inverse, but uses the implicit-operand form of (`imul r32`) to produce the high half of the product in edx, instead of `imul r64, r64, imm32` + shift to get the high half of the product in eax. – Peter Cordes Sep 06 '16 at 19:50
1

Beware this answer it's only intuition

It's because of the nature of the % operator

In testNModConst() 1230 is less than 12345, so modulus only needs to internally check their size and realize that 12345 is bigger (one operation)

In testConstModN() 12345 is greater than 1230, so modulus will have to do a series of operations (subtraction) to calculate the answer.

It depends on the size of the integer relative to the constant

cjds
  • 269
  • 1
  • 3
  • 12