Looking at the ICC 17 generated code for iterating over a std::unordered_map<> (using https://godbolt.org) left me very confused.
I distilled down the example to this:
long count(void** x)
{
long i = 0;
while (*x)
{
++i;
x = (void**)*x;
}
return i;
}
Compiling this with ICC 17, with the -O3 flag, leads to the following disassembly:
count(void**):
xor eax, eax #6.10
mov rcx, QWORD PTR [rdi] #7.11
test rcx, rcx #7.11
je ..B1.6 # Prob 1% #7.11
mov rdx, rax #7.3
..B1.3: # Preds ..B1.4 ..B1.2
inc rdx #7.3
mov rcx, QWORD PTR [rcx] #7.11
lea rsi, QWORD PTR [rdx+rdx] #9.7
lea rax, QWORD PTR [-1+rdx*2] #9.7
test rcx, rcx #7.11
je ..B1.6 # Prob 18% #7.11
mov rcx, QWORD PTR [rcx] #7.11
mov rax, rsi #9.7
test rcx, rcx #7.11
jne ..B1.3 # Prob 82% #7.11
..B1.6: # Preds ..B1.3 ..B1.4 ..B1.1
ret #12.10
Compared to the obvious implementation (which gcc and clang use, even for -O3), it seems to do a few things differently:
- It unrolls the loop, with two decrements before looping back - however, there is a conditional jump in the middle of it all.
- It uses lea for some of the arithmetic
- It keeps a counter (inc rdx) for every two iterations of the while loop, and immediately computes the corresponding counters for every iteration (into rax and rsi)
What are the potential benefits to doing all this? I assume it may have something to do with scheduling?
Just for comparison, this is the code generated by gcc 6.2:
count(void**):
mov rdx, QWORD PTR [rdi]
xor eax, eax
test rdx, rdx
je .L4
.L3:
mov rdx, QWORD PTR [rdx]
add rax, 1
test rdx, rdx
jne .L3
rep ret
.L4:
rep ret