4

I keep reading that bitshifting is not needed, as compiler optimisations will translate a multiplication to a bitshift. Such as Should I bit-shift to divide by 2 in Java? and Is shifting bits faster than multiplying and dividing in Java? .NET?

I am not inquiring here about the performance difference, I could test that out myself. But what I think is curious, is that several people mention that it will be "compiled to the same thing". Which seems to be not true. I have written a small piece of code.

private static void multi()
{
    int a = 3;
    int b = a * 2;
    System.out.println(b);
}

private static void shift()
{
    int a = 3;
    int b = a << 1L;
    System.out.println(b);
}

Which gives the same result, and just prints it out.

When I look at the generated Java Bytecode, the following is shown.

private static void multi();
Code:
   0: iconst_3
   1: istore_0
   2: iload_0
   3: iconst_2
   4: imul
   5: istore_1
   6: getstatic     #4                  // Field java/lang/System.out:Ljava/io/PrintStream;
   9: iload_1
  10: invokevirtual #5                  // Method java/io/PrintStream.println:(I)V
  13: return

private static void shift();
Code:
   0: iconst_3
   1: istore_0
   2: iload_0
   3: iconst_1
   4: ishl
   5: istore_1
   6: getstatic     #4                  // Field java/lang/System.out:Ljava/io/PrintStream;
   9: iload_1
  10: invokevirtual #5                  // Method java/io/PrintStream.println:(I)V
  13: return

Now we can see the difference between "imul" and "ishl".

My question being: clearly the spoken off optimization is not visible in the java bytecode. I am still assuming the optimization does happen, so does it just happen at a lower level? Or, alternatively because it is Java, does the JVM when encountering the imul statement somehow know that it should be translated to something else. If so, any resources on how this is handled would be greatly appreciated.

(as a sidenote, I am not trying to justify the need for bitshifting. I think it decreases readability, at least to people used to Java, for C++ that might be different. I am just trying to see where the optimization happens).

Community
  • 1
  • 1
Dylan Meeus
  • 5,696
  • 2
  • 30
  • 49
  • The bytecode will be converted into machine code for execution by the JIT, that conversion will optimize the differences away (and they will perform identically). – Elliott Frisch Jun 27 '16 at 09:50
  • The "optimization" might happen at the JVM level since it's written for the target arch in c/c++; but as you spotted - it's not quite the same. Sintactically, shifting left is the same as multiplying by two, but semantically it's two different instructions. The machine code interpreter might convert them to the same thing but it's not guaranteed and certainly not very visible without diving into the JVM code. – Shark Jun 27 '16 at 09:53

1 Answers1

6

The question in the title sounds a bit different than the question in the text. The quoted statement that the shift and the multiplication would be "compiled to the same thing" is true. But it does not yet apply to the bytecode.

In general, the Java bytecode is rather un-optimized. There are very few optimizations done at all - mainly inlining of constants. Beyond that, the Java bytecode is just an intermediate representation of the original program. And the translation from Java to Java bytecode is done rather "literally".

(Which, I think, is a good thing. The bytecode still closely resembles the original Java code. All the nitty-gritty (platform specific!) optimizations that may be possible are left to the virtual machine, which has far more options here.

All further optimizations, like arithmetic optimizations, dead code elimination or method inlining, are done by the JIT (Just-In-Time-Compiler), at runtime. The Just-In-Time compiler also applies the optimization of replacing the multiplication by a bit shift.

The example that you gave makes it a bit difficult to show the effect, for several reasons. The fact that the System.out.println was included in the method tends to make the actual machine code large, due to inlining and the general prerequisites for calling this method. But more importantly, the shift by 1, which corresponds to a multiplication with 2, also corresponds to an addition of the value to itself. So instead of observing a shl (left-shift) assembler instruction in the resulting machine code for the multi method, you'd likely see a disguised add instruction in the multi- and the shift method.

However, here is a very pragmatic example that does a left-shift by 8, corresponding to a multiplication with 256:

class BitShiftOptimization
{
    public static void main(String args[])
    {
        int blackHole = 0;
        for (int i=0; i<1000000; i++)
        {
            blackHole += testMulti(i);
            blackHole += testShift(i);
        }
        System.out.println(blackHole);

    }

    public static int testMulti(int a)
    {
        int b = a * 256;
        return b;
    }

    public static int testShift(int a)
    {
        int b = a << 8L;
        return b;
    }
}

(It receives the value to be shifted as an argument, to prevent it from being optimized away into a constant. It calls the methods several times, to trigger the JIT. And it returns and collects the values from both methods to prevent the method calls to be optimized away. Again, this is very pragmatic, but sufficient to show the effect)

Running this in a Hotspot Disassembler VM with

java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:+PrintInlining -XX:+PrintAssembly BitShiftOptimization

will produce the following assembler code for the testMulti method:

Decoding compiled method 0x000000000286fbd0:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x000000001c0003b0} &apos;testMulti&apos; &apos;(I)I&apos; in &apos;BitShiftOptimization&apos;
  # parm0:    rdx       = int
  #           [sp+0x40]  (sp of caller)
  0x000000000286fd20: mov    %eax,-0x6000(%rsp)
  0x000000000286fd27: push   %rbp
  0x000000000286fd28: sub    $0x30,%rsp
  0x000000000286fd2c: movabs $0x1c0005a8,%rax   ;   {metadata(method data for {method} {0x000000001c0003b0} &apos;testMulti&apos; &apos;(I)I&apos; in &apos;BitShiftOptimization&apos;)}
  0x000000000286fd36: mov    0xdc(%rax),%esi
  0x000000000286fd3c: add    $0x8,%esi
  0x000000000286fd3f: mov    %esi,0xdc(%rax)
  0x000000000286fd45: movabs $0x1c0003a8,%rax   ;   {metadata({method} {0x000000001c0003b0} &apos;testMulti&apos; &apos;(I)I&apos; in &apos;BitShiftOptimization&apos;)}
  0x000000000286fd4f: and    $0x1ff8,%esi
  0x000000000286fd55: cmp    $0x0,%esi
  0x000000000286fd58: je     0x000000000286fd70  ;*iload_0
                        ; - BitShiftOptimization::testMulti@0 (line 17)

  0x000000000286fd5e: shl    $0x8,%edx
  0x000000000286fd61: mov    %rdx,%rax
  0x000000000286fd64: add    $0x30,%rsp
  0x000000000286fd68: pop    %rbp
  0x000000000286fd69: test   %eax,-0x273fc6f(%rip)        # 0x0000000000130100
                        ;   {poll_return}
  0x000000000286fd6f: retq   
  0x000000000286fd70: mov    %rax,0x8(%rsp)
  0x000000000286fd75: movq   $0xffffffffffffffff,(%rsp)
  0x000000000286fd7d: callq  0x000000000285f160  ; OopMap{off=98}
                        ;*synchronization entry
                        ; - BitShiftOptimization::testMulti@-1 (line 17)
                        ;   {runtime_call}
  0x000000000286fd82: jmp    0x000000000286fd5e
  0x000000000286fd84: nop
  0x000000000286fd85: nop
  0x000000000286fd86: mov    0x2a8(%r15),%rax
  0x000000000286fd8d: movabs $0x0,%r10
  0x000000000286fd97: mov    %r10,0x2a8(%r15)
  0x000000000286fd9e: movabs $0x0,%r10
  0x000000000286fda8: mov    %r10,0x2b0(%r15)
  0x000000000286fdaf: add    $0x30,%rsp
  0x000000000286fdb3: pop    %rbp
  0x000000000286fdb4: jmpq   0x0000000002859420  ;   {runtime_call}
  0x000000000286fdb9: hlt    
  0x000000000286fdba: hlt    
  0x000000000286fdbb: hlt    
  0x000000000286fdbc: hlt    
  0x000000000286fdbd: hlt    
  0x000000000286fdbe: hlt    

(the code for the testShift method has the same instructions, by the way).

The relevant line here is

  0x000000000286fd5e: shl    $0x8,%edx

which corresponds to the left-shift by 8.

Marco13
  • 53,703
  • 9
  • 80
  • 159
  • Thank you Marco, that's a beautiful answer and just what I wanted to know. May I ask you which disassembler VM you have used for this? I'd be more than interested to play around with this myself. :-) – Dylan Meeus Jun 27 '16 at 11:45
  • Note that leaving the bytecode unoptimized also makes debugging a lot easier. – Antimony Jun 27 '16 at 13:52
  • @DylanMeeus In order to use this disassembling functionality, you only need to throw the `hsdis-amd64.dll` (or `.so` or `.dylib` for Linux or Mac) into the `jre/bin/server` directory. Compiling this on your own is a bit fiddly, but you can find precompiled versions on the web (I took the one from [this site](http://lafo.ssw.uni-linz.ac.at/hsdis/att/), but no warranty...) – Marco13 Jun 27 '16 at 14:28
  • @Antimony Likely true, but the exception handling (and how erroneous conditions are detected and find their way from the optimized machine code into the JVM) still involves lots of black magic. I can identify some parts of the assembler code as "related to this", but am far from understanding what's going on there internally. – Marco13 Jun 27 '16 at 14:30
  • 1
    I think, the platform specific nature of the optimization should be emphasized, rather than printed in small letters. If Java (source to byte code) compilers ever do optimizations, they strive for compacting the code as that’s the only universal optimization, saving disk space and loading time. What is more efficient for the target CPU, is unpredictable at compile time. For the question’s code, replacing the entire arithmetic with the constant result is much more likely. It could even happen with this answer’s code though it would require an aggressively optimizing JVM and longer running time… – Holger Jun 27 '16 at 15:10