3

I have tried to test if using var & 3 is faster than var % 4 in java (it could also be & 2^n - 1 vs. % 2^n). I have made a simple program to calculate the average time it takes to do the calculations, but I get strange results and I can't conclude. For about 1000 calculations, the average is that mod 4 takes much more time, but when I try with about 1000000 calculations, both averages are about the same... I suspect this is due to java optimization of my code, but I am not sure.

Which of those two operations is supposed to be faster, and how is % implemented?

Thanks!

EDIT: Here is my test program.

    long startTime, time, sum;
    int iterations = 1000;
    int v;

    sum = 0;
    for(int i = 0; i < iterations; i++)
    {
        startTime = System.nanoTime();
        v = i % 4;
        time = System.nanoTime();
        sum += time-startTime;
    }
    System.out.println("Mod 4 : "+(sum/iterations));

    sum = 0;
    for(int i = 0; i < iterations; i++)
    {
        startTime = System.nanoTime();
        v = i & 3;
        time = System.nanoTime();
        sum += time-startTime;
    }
    System.out.println("& 3 : "+(sum/iterations));

With 100 iterations, I get 130 nanosec for mod 4 and 25060 nanosec for & 3.

For 1000 iterations, I get 1792 nanosec for mod 4 and 81 nanosec for & 3.

With 1000000 iterations, I get about 50 nanosec for both, while having mod 4 always a few nanosec longer.

  • 2
    It depends on how you perform the tests. I suggest testing negative number as well. – Peter Lawrey Jan 26 '14 at 22:18
  • 2
    I'm sure there are experts on this but Java Hotspot VM sometimes choose to interpret some portions of the code instead of compiling it to native code (because compiling might take more time than just simply interpreting it). This might be the case. Use `-Xint` when running your code to see whether it's because Java compiles it to native code in your second example. – regulus Jan 26 '14 at 22:23
  • Regulus has a point. However, if results tend to stabilize when you hit the millions of tests (so you have a significant amount of statistical data), then you should consider that you just cannot certainly say that one option is faster than the other one - it may be just a draw. – Jorge_B Jan 26 '14 at 22:25
  • On another note, there is a slight discrepancy in results for negative numbers, -5 % 3 = -2 whereas -5 & 3 = 3... – Sinkingpoint Jan 26 '14 at 22:29
  • It might be useful to post your test and your times. – Radiodef Jan 26 '14 at 22:41
  • You should use a proper benchmarking framework such as jmh. – assylias Jan 26 '14 at 22:58
  • See also http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – assylias Jan 26 '14 at 23:04
  • Thanks for the link assylias, I'll look into that. –  Jan 26 '14 at 23:06
  • Check if it changes if you replace order of tests in your program. Assignments to `v` are probably optimized to no-op when JIT compiler activates, so all you are measuring then `System.nanoTime`. – zch Jan 27 '14 at 00:46
  • My results are reversed when I change the order of the tests, so you're probably right. –  Jan 27 '14 at 01:31
  • @Quirliom: This discrepancy gets solved via a test, which itself costs some time, but it's still faster than a division. In my benchmark below you can the branch misprediction penalty when negative numbers are involved. – maaartinus Jan 27 '14 at 06:58
  • don't know about Java compiler but most other compilers will transfer division/modulus by constant into multiplications (or and/shift for powers of 2s). You need to check the output assembly/bytecode – phuclv Jan 27 '14 at 07:03
  • Note that `System.nanoTime()` takes much much longer than the operation you're measuring. A simple operation mustn't take more than 1 ns on a non-museum-grade computer. – maaartinus Jan 27 '14 at 15:38

2 Answers2

4

Java, or any compiler for that matter, may optimize this statically or at runtime (for ones with JIT-ing capabilities), so it's hard to tell what your code really does, but if you inspect the machine code being performed eventually on any host machine, it's almost guaranteed that doing an AND operation would be significantly faster in terms of latency (and probably also in throughput) than modulo. The first requires very simple ALU unit that usually exists in abundance on most CPU cores, while the modulo would probably have to be performed over a divider unit that is both slower and more scarce (i.e. - exists on less execution ports).

However, there are too many layers between your java code and the actual bare metal CPU to be able to give concrete answer, you should either switch to a lower level of benchmarking (c or assembly), or take into account the other factors and observe the bytecode and on-the-fly changes made by the compiler.

Leeor
  • 19,260
  • 5
  • 56
  • 87
0

I was curious and wrote my own caliper benchmark. The results are per 1000 elements. Note that there's some non-negligible overhead and the speed ratio for the operations itself is much bigger.

maaartinus
  • 44,714
  • 32
  • 161
  • 320