I have been reading Effective Java, 3/E.
While reading the part regarding hashcode, (page 51) I noticed the book saying
A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance on some architectures:
31 * i == (i << 5) - i
. Modern VMs do this sort of optimization automatically.
I thought this makes sense. And I wondered how much faster would the code become when this kind of optimization took place. So I wrote a short code to see the impact of such optimizations.
But, it seemed like there were no noticeable differences. So I wrote much simpler code, to check if such optimization even took place.
Below is my sample code.
fun main() {
val num = Random.nextInt()
val a = num * 30
val b = num * 31
val c = num * 32
println("$a, $b, $c")
}
And this is the compiled machine code, got from IntelliJ's Kotlin bytecode feature.
L1
LINENUMBER 5 L1
ILOAD 0
BIPUSH 30
IMUL
ISTORE 1
L2
LINENUMBER 6 L2
ILOAD 0
BIPUSH 31
IMUL
ISTORE 2
L3
LINENUMBER 7 L3
ILOAD 0
BIPUSH 32
IMUL
ISTORE 3
Apparently, there are no differences. We push each number and just call IMUL
. I thought maybe the optimization takes place when the Java bytecode is compiled into actual machine code, but I have never checked that side, so I do not know how to confirm my theory. I searched, and it seems like the keyword I am looking for is the JIT compiler, which seemingly converts the .class into cpu-specific machine code.
I thought, maybe I can try to convert this code to cpu-specific machine code through JIT compiler, but then that means I checked this theory on one specific CPU, not all CPUs. I want to see if it is "generally true", but that will take too much time.
So, is there any way to confirm that above code is actually (generally) being optimized by the compiler? If I have similar questions in the future, where should I look for? I mean, when I'm curious about java behavior, I go the oracle and check JVM reference or java se reference. But what about the compiler behavior? Where should I start?
That was a long question. Thank you for spending your precious time on reading this question.
(just an additional note)
I checked C and python on https://godbolt.org/, and confirmed that for C, it is actually optimized.
int test(int num) {
int n = rand();
int a= n*30;
int b= n*31;
int c= n*32;
return a * b * c;
}
test:
push rbp
mov rbp, rsp
sub rsp, 32
mov DWORD PTR [rbp-20], edi
call rand
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
imul eax, eax, 30
mov DWORD PTR [rbp-8], eax
mov edx, DWORD PTR [rbp-4]
mov eax, edx
sal eax, 5
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-4]
sal eax, 5
mov DWORD PTR [rbp-16], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-12]
imul eax, DWORD PTR [rbp-16]
leave
ret
However, python, was not.
num = randint()
a = num * 30
b = num * 31
c = num * 32
5 18 LOAD_NAME 2 (num)
20 LOAD_CONST 2 (30)
22 BINARY_MULTIPLY
24 STORE_NAME 3 (a)
6 26 LOAD_NAME 2 (num)
28 LOAD_CONST 3 (31)
30 BINARY_MULTIPLY
32 STORE_NAME 4 (b)
7 34 LOAD_NAME 2 (num)
36 LOAD_CONST 4 (32)
38 BINARY_MULTIPLY
40 STORE_NAME 5 (c)
42 LOAD_CONST 5 (None)
44 RETURN_VALUE