Performance comparison of modulo operator and bitwise AND

Question

I'm working to determine if an 32bit integer is even or odd. I have setup 2 approaches:

modulo(%) approach

int r = (i % 2);

bitwise(&) approach

int r = (i & 0x1);

Both approaches work successfully. So I run each line for 15000 times to test performance.

Result:

modulo(%) approach (source code)

mean 141.5801887ns | SD 270.0700275ns

bitwise(&) approach (source code)

mean 141.2504ns | SD 193.6351007ns

Questions:

Why is bitwise(&) more stable than division(%) ?

Does JVM optimize modulo(%) using AND(&) according to here?

Your two benchmarks look almost identical to me, minus a slightly different standard deviation. But, there is a chance that you did not even setup both tests in a statistically meaningful way. — Tim Biegeleisen, Jul 08 '18 at 14:13
First first link should probably be https://github.com/kflau/modulo-division-vs-bitwise-benchmark/blob/master/src/test/java/com/aei/bit/modulo/benchmark/ModulusTest.java instead. — luk2302, Jul 08 '18 at 14:14
Relevant anyway: [How do I write a correct micro-benchmark in Java?](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) — luk2302, Jul 08 '18 at 14:14
You're doing it completely wrong. You can't use nanoTime() to measure a single operation execution, since it usually has granularity of hunder of nanoseconds or even more (depending on your OS). Use JMH and check an assembly code. — Andrey Cheboksarov, Jul 08 '18 at 14:16

St.Antario · Accepted Answer · 2018-07-08T15:32:34.967

Let's try to reproduce with JMH.

@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int first() throws IOException {
    return i % 2;
}

@Benchmark
@Measurement(timeUnit = TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
public int second() throws IOException {
    return i & 0x1;
}

Okay, it is reproducable. The first is slightly slower than the second. Now let's figure out why. Run it with -prof perfnorm:

Benchmark                                 Mode  Cnt   Score    Error  Units
MyBenchmark.first                         avgt   50   2.674 ±  0.028  ns/op
MyBenchmark.first:CPI                     avgt   10   0.301 ±  0.002   #/op
MyBenchmark.first:L1-dcache-load-misses   avgt   10   0.001 ±  0.001   #/op
MyBenchmark.first:L1-dcache-loads         avgt   10  11.011 ±  0.146   #/op
MyBenchmark.first:L1-dcache-stores        avgt   10   3.011 ±  0.034   #/op
MyBenchmark.first:L1-icache-load-misses   avgt   10  ≈ 10⁻³            #/op
MyBenchmark.first:LLC-load-misses         avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:LLC-loads               avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:LLC-store-misses        avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.first:LLC-stores              avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:branch-misses           avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:branches                avgt   10   4.006 ±  0.054   #/op
MyBenchmark.first:cycles                  avgt   10   9.322 ±  0.113   #/op
MyBenchmark.first:dTLB-load-misses        avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:dTLB-loads              avgt   10  10.939 ±  0.175   #/op
MyBenchmark.first:dTLB-store-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.first:dTLB-stores             avgt   10   2.991 ±  0.045   #/op
MyBenchmark.first:iTLB-load-misses        avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.first:iTLB-loads              avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.first:instructions            avgt   10  30.991 ±  0.427   #/op
MyBenchmark.second                        avgt   50   2.263 ±  0.015  ns/op
MyBenchmark.second:CPI                    avgt   10   0.320 ±  0.001   #/op
MyBenchmark.second:L1-dcache-load-misses  avgt   10   0.001 ±  0.001   #/op
MyBenchmark.second:L1-dcache-loads        avgt   10  11.045 ±  0.152   #/op
MyBenchmark.second:L1-dcache-stores       avgt   10   3.014 ±  0.032   #/op
MyBenchmark.second:L1-icache-load-misses  avgt   10  ≈ 10⁻³            #/op
MyBenchmark.second:LLC-load-misses        avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:LLC-loads              avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:LLC-store-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:LLC-stores             avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:branch-misses          avgt   10  ≈ 10⁻⁴            #/op
MyBenchmark.second:branches               avgt   10   4.014 ±  0.045   #/op
MyBenchmark.second:cycles                 avgt   10   8.024 ±  0.098   #/op
MyBenchmark.second:dTLB-load-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:dTLB-loads             avgt   10  10.989 ±  0.161   #/op
MyBenchmark.second:dTLB-store-misses      avgt   10  ≈ 10⁻⁶            #/op
MyBenchmark.second:dTLB-stores            avgt   10   3.004 ±  0.042   #/op
MyBenchmark.second:iTLB-load-misses       avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:iTLB-loads             avgt   10  ≈ 10⁻⁵            #/op
MyBenchmark.second:instructions           avgt   10  25.076 ±  0.296   #/op

Note the difference in cycles and instructions. And now that's kind of obvious. The first does care about the sign, but the second does not (just bitwise and). To make sure this is the reason take a look at the assembly fragment:

first:

0x00007f91111f8355: mov     0xc(%r10),%r11d   ;*getfield i
0x00007f91111f8359: mov     %r11d,%edx
0x00007f91111f835c: and     $0x1,%edx
0x00007f91111f835f: mov     %edx,%r10d
0x00007f6bd120a6e2: neg     %r10d
0x00007f6bd120a6e5: test    %r11d,%r11d
0x00007f6bd120a6e8: cmovl   %r10d,%edx

second:

0x00007ff36cbda580: mov     $0x1,%edx
0x00007ff36cbda585: mov     0x40(%rsp),%r10
0x00007ff36cbda58a: and     0xc(%r10),%edx

I'm missing here an explanation like "modulus is a terribly slow operation, but the JVM optimizes `i % 2` as `i > 0 ? i & 1 : -(i & 1)`. — maaartinus, Jul 18 '18 at 11:23
@maaartinus This is true Iff `i` is a power of 2. Otherwise it compiles to some heavier bit-manipulations (surprisingly, not simple `idiv` as I expected). — St.Antario, Jul 18 '18 at 13:01

meriton · Answer 2 · 2018-07-08T14:43:45.793

An execution time of 150 ns is about 500 clock cycles. I don't think there has ever been a processor that went about the business of checking a bit this inefficiently :-).

The problem is that your test harness is flawed in many ways. In particular:

you make no attempt to trigger JIT compilation before starting the clock
System.nanotime() is not guaranteed to have nanosecond accuracy
System.nanotime() is quite a bit more expensive to call that the code you want to measure

See How do I write a correct micro-benchmark in Java? for a more complete list of things to keep in mind.

Here's a better benchmark:

public abstract class Benchmark {

    final String name;

    public Benchmark(String name) {
        this.name = name;
    }

    @Override
    public String toString() {
        return name + "\t" + time() + " ns / iteration";
    }

    private BigDecimal time() {
        try {
            // automatically detect a reasonable iteration count (and trigger just in time compilation of the code under test)
            int iterations;
            long duration = 0;
            for (iterations = 1; iterations < 1_000_000_000 && duration < 1_000_000_000; iterations *= 2) {
                long start = System.nanoTime();
                run(iterations);
                duration = System.nanoTime() - start;
                cleanup();
            }
            return new BigDecimal((duration) * 1000 / iterations).movePointLeft(3);
        } catch (Throwable e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * Executes the code under test.
     * @param iterations
     *            number of iterations to perform
     * @return any value that requires the entire code to be executed (to
     *         prevent dead code elimination by the just in time compiler)
     * @throws Throwable
     *             if the test could not complete successfully
     */
    protected abstract Object run(int iterations) throws Throwable;

    /**
     * Cleans up after a run, setting the stage for the next.
     */
    protected void cleanup() {
        // do nothing
    }

    public static void main(String[] args) throws Exception {
        System.out.println(new Benchmark("%") {
            @Override
            protected Object run(int iterations) throws Throwable {
                int sum = 0;
                for (int i = 0; i < iterations; i++) {
                    sum += i % 2;
                }
                return sum; 
            }
        });
        System.out.println(new Benchmark("&") {
            @Override
            protected Object run(int iterations) throws Throwable {
                int sum = 0;
                for (int i = 0; i < iterations; i++) {
                    sum += i & 1;
                }
                return sum;
            }
        });
    }
}

On my machine, it prints:

%   0.375 ns / iteration
&   0.139 ns / iteration

So the difference is, as expected, on the order of a couple clock cycles. That is, & 1 was optimized slightly better by this JIT on this particular hardware, but the difference is so small it is extremely unlikely to have a measurable (let alone significant) impact on the performance of your program.

score 0 · Answer 3 · answered Jul 08 '18 at 15:11

0

The two operations correspond to different JVM processor instructions:

irem     // int remainder (%)
iand     // bitwise and (&)

Somewhere I read irem is usually implemented by the JVM, while iand is available on hardware. Oracle explains the two instructions as follows:

iand

An int result is calculated by taking the bitwise AND (conjunction) of value1 and value2.

irem

The int result is value1 - (value1 / value2) * value2.

It seems reasonable to me to assume that iand results in less CPU cycles.

answered Jul 08 '18 at 15:11

steffen

16,138
4
42
81

Most CPUs can compute reminder, too, but it's slow and the JVM [optimizes it to AND if possible](https://stackoverflow.com/questions/51232809/performance-comparison-of-modulo-operator-and-bitwise-and#comment89772081_51233271). – maaartinus Jul 18 '18 at 11:26
@maaartinus When does this optimization take place? – steffen Jul 18 '18 at 15:52
2

Practically, all optimizations take place at runtime. `javac` (or `ecj` or whatever) doesn't care. The JVM does. Many optimizations lead to a code bloat and it's much better to concentrate on the hot spots (and so they call [their compiler](https://en.wikipedia.org/wiki/HotSpot)). Moreover, many optimizations are based on some assumptions, which may get invalidated later and the JVM must [deoptimize](https://stackoverflow.com/a/20542675/581205). Therefore you see nothing like that in the bytecode. – maaartinus Jul 18 '18 at 16:58

Performance comparison of modulo operator and bitwise AND

3 Answers3