What are these "abundant" assembly codes used for and why are they compiled out?

Question

I'm not familiar with the implementation of C++ compilers and I'm writing a C++ snippet like this (for learning):

#include <vector>

void vector8_inc(std::vector<unsigned char>& v) {
    for (std::size_t i = 0; i < v.size(); i++) {
        v[i]++;
    }
}

void vector32_inc(std::vector<unsigned int>& v) {
    for (std::size_t i = 0; i < v.size(); i++) {
        v[i]++;
    }
}

int main() {
    std::vector<unsigned char> my{ 10,11,2,3 };
    vector8_inc(my);
    std::vector<unsigned int> my2{ 10,11,2,3 };
    vector32_inc(my2);
}

By generating its assembly codes and disassemble its binary, I found the segments for vector8_inc and vector32_inc:

g++ -S test.cpp -o test.a -O2

_Z11vector8_incRSt6vectorIhSaIhEE:
.LFB853:
    .cfi_startproc
    endbr64
    movq    (%rdi), %rdx
    cmpq    8(%rdi), %rdx
    je  .L1
    xorl    %eax, %eax
    .p2align 4,,10
    .p2align 3
.L3:
    addb    $1, (%rdx,%rax)
    movq    (%rdi), %rdx
    addq    $1, %rax
    movq    8(%rdi), %rcx
    subq    %rdx, %rcx // line C: calculate size of the vector
    cmpq    %rcx, %rax
    jb  .L3
.L1:
    ret
    .cfi_endproc
......

_Z12vector32_incRSt6vectorIjSaIjEE:
.LFB854:
    .cfi_startproc
    endbr64
    movq    (%rdi), %rax   // line A
    movq    8(%rdi), %rdx
    subq    %rax, %rdx // line C': calculate size of the vector
    movq    %rdx, %rcx
    shrq    $2, %rcx   // Line B: ← What is this used for?
    je  .L6
    addq    %rax, %rdx
    .p2align 4,,10
    .p2align 3
.L8:
    addl    $1, (%rax)
    addq    $4, %rax
    cmpq    %rax, %rdx
    jne .L8
.L6:
    ret
    .cfi_endproc
......

but in fact, these two functions are "inlined" within main(), and the instructions shown above will never be executed but compiled and loaded into memory (I verified this by adding a breakpoint to line A and checking the memory): main() contains duplicate logic of the above segment

So here are my questions:

What are these two segments used for and why are they compiled out? _Z12vector32_incRSt6vectorIjSaIjEE, _Z11vector8_incRSt6vectorIhSaIhEE
Does the instruction (Line B) in _Z12vector32_incRSt6vectorIjSaIjEE (i.e. shrq $2, %rcx ) means 'to shift size integer of the vector right by 2 bits'? If it does, what's this used for?
Why is size() computed every turn of the loop in vector8_inc, but in vector32_inc it's computed before executing the loop? (see Line C and Line C', under -O2 optimization level of g++)

Nate Eldredge · Answer 1 · 2022-01-28T08:38:28.893

First, function names are a lot more readable if you demangle them, with a tool such as c++filt:

_Z12vector32_incRSt6vectorIjSaIjEE 
-> vector32_inc(std::vector<unsigned int, std::allocator<unsigned int> >&)
_Z11vector8_incRSt6vectorIhSaIhEE 
-> vector8_inc(std::vector<unsigned char, std::allocator<unsigned char> >&)

Q1

You defined those functions as global (external linkage). The compiler compiles one module at a time. It doesn't know whether this file is your entire program, or if you will link it with other object files later. In the latter case, some function from another module might call vector8_inc or vector32_inc, and so the compiler needs to create an out-of-line version to be available for such a call.

A traditional linker links code at the level of modules, and isn't able to selectively exclude functions from individual object files. It has to include your test.o module in the link, so that means the binary contains all the code from that module, including the out-of-line vector_inc functions. You know they can never be called, but the compiler didn't know that and the linker doesn't check.

Here are three ways you can avoid having them included in your binary:

Declare those functions as static in your source code. This ensures they cannot be called from other modules, so if the compiler can inline all calls from within this module, it need not emit the out-of-line versions.
Compile with -fwhole-program which tells GCC to assume that functions cannot be called from anywhere it can't see.
Link using -flto to enable link time optimization. This provides more data to the linker, so that it can be smart and do things like omit individual functions that are never called (and much more). When I used -flto, again I no longer saw the out-of-line code in the binary.

Q2

Shift right by 2 is division by 4. It looks like the std::vector class, in GCC's implementation, contains a start pointer (loaded into %rax) and a finish pointer (loaded into %rdx). Subtracting these pointers yields the number of bytes in the vector. For a vector of unsigned int, each element is 4 bytes in size, so dividing by 4 yields the number of elements.

This value is only used as a test of whether the vector is empty, by comparing it to zero. In this case the loop is skipped and the function does nothing. (shr will set the zero flag if the result is zero, and je jumps if the zero flag is set.) So the only way the shift matters, instead of just comparing the pointer difference to 0, is if the pointer difference were 1, 2 or 3. That should not happen, since pointers to unsigned int should be 4-byte aligned (unless the library is doing clever things like storing metadata in the low bits). Thus I think the shift is actually unnecessary, and this may be a minor missed optimization.

Q3

This is most likely to handle aliasing.

There is a general rule in C++ that, roughly speaking, writing through a pointer of one type may not modify an object of another type. So in vector32_inc, the compiler knows that writing to elements of the vector, through a pointer to unsigned int, cannot modify the vector's metadata (which apparently is objects of some other type, most likely pointers). Therefore it is safe to hoist the size() computation out of the loop, since it cannot change over the course of the loop's execution. Likewise, the pointer used to iterate over the object can be kept in a register throughout.

However, character types have a special exemption from this rule, partly so that things like memcpy can work. So when you do vec[i]++, writing through a pointer to unsigned char, the compiler has to account for the possibility that the pointer may actually point to some of the vector metadata. (This should not actually happen if vector is being used properly, but the compiler doesn't know that.) If it did, then vec[i]++ might modify that metadata in memory. The next loop iteration would then be supposed to use the newly modified metadata, so it has to be reloaded from memory just in case. Note that the vector's start pointer is also reloaded, and the new value used, instead of keeping it in a register throughout.

What are these "abundant" assembly codes used for and why are they compiled out?

1 Answers1

Q1

Q2

Q3