2

Consider this small C++ code snippet:

#include <iostream>
#include <string>

int main() {
    std::cout << std::string("This.String.Ends!") << std::endl;
}

A portion of the assembly generated by this snippet (compiled with clang++ -O3):

...

call    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)
mov     qword ptr [rsp + 16], rax
mov     rcx, qword ptr [rsp + 8]
mov     qword ptr [rsp + 32], rcx
movups  xmm0, xmmword ptr [rip + .L.str]
movups  xmmword ptr [rax], xmm0
mov     byte ptr [rax + 16], 33           <--------- !!!!!!!!!
mov     qword ptr [rsp + 24], rcx
mov     rax, qword ptr [rsp + 16]
mov     byte ptr [rax + rcx], 0

...

.L.str:
        .asciz  "This.String.Ends!"

Even though the string literal has the character '!' at the end, the generated assembly has an additional instruction to add it explicitly. Questions:

  1. What is this optimization called? Is there a formal name for it?
  2. I am able to reproduce this behaviour with strings of size 8x+y. I can imagine that fetching only one character from memory is expensive than using an additional instruction. Is that the case here? If so, why not inline the whole string (it's quite short in this case)?
  3. What are the different ways in which I can still keep -O3, but avoid this particular optimization? From hit and trial, I could find that using a combination of these (with g++) disables it: -fno-tree-ccp -fno-tree-dominator-opts -fno-tree-forwprop -fno-tree-fre -fno-code-hoisting -fno-tree-pre -fno-tree-vrp, but I am guessing each one does something more which I probably don't want to miss out on.
  4. If the compiler does generate an instruction for the last character, why still leave the complete string in .rodata section of the binary, and not just the starting 16 bytes? Doesn't it waste space?

My use case: This optimization creates a problem for patching string literals in binaries post compile+link (replacing 'x' with 'y' padded with \0*(len('x')-len('y')), assuming len('x') >= len('y')). I know that there might be more optimizations of this kind which won't let me achieve this, but I just wanted to provide some context on how I hit this issue.

Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89
  • [Small String Optimization](https://stackoverflow.com/questions/21694302/what-are-the-mechanics-of-short-string-optimization-in-libc). Does that answer your question? – PaulMcKenzie Jul 26 '20 at 02:35
  • @PaulMcKenzie I did read about that, however, I am not able to convince myself that it is SSO that's in play here, because of the different sizes. As you can see here it's always in multiples of 8 + that I see this behaviour and it is consistent in clang and gcc, and not dependent on sizes: 15/32/22/24 – Nehal J Wani Jul 26 '20 at 02:40
  • Here is another example: https://godbolt.org/z/Eo9T6j – Nehal J Wani Jul 26 '20 at 02:46
  • Probably to decrease executable size, because of the alignment of object in static storage and because of the size of registers. – Oliv Jul 26 '20 at 08:58
  • This is an instance of the "load-known-value" optimization. Since the compiler knows that the value is always 33 there is no point in loading it from memory. – Johan Aug 31 '20 at 10:25
  • @Johan Could you point me to some online resource/code where I can read more about it? – Nehal J Wani Aug 31 '20 at 15:10

0 Answers0