Printing a character literal is more complicated than printing a string literal

Question

The following code

#include <iostream>

void foo() {
    std::cout << ' ';
}

void bar() {
    std::cout << " ";
}

produces the following output in g++ 10.2 with -O3 option:

foo():
        sub     rsp, 24
        mov     edx, 1
        mov     edi, OFFSET FLAT:_ZSt4cout
        lea     rsi, [rsp+15]
        mov     BYTE PTR [rsp+15], 32
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        add     rsp, 24
        ret
.LC0:
        .string " "
bar():
        mov     edx, 1
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:_ZSt4cout
        jmp     std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
_GLOBAL__sub_I_foo():
        sub     rsp, 8
        mov     edi, OFFSET FLAT:_ZStL8__ioinit
        call    std::ios_base::Init::Init() [complete object constructor]
        mov     edx, OFFSET FLAT:__dso_handle
        mov     esi, OFFSET FLAT:_ZStL8__ioinit
        mov     edi, OFFSET FLAT:_ZNSt8ios_base4InitD1Ev
        add     rsp, 8
        jmp     __cxa_atexit

Here we can see that in both cases std::__ostream_insert function is called, but in 2 different ways: using call and using jmp. In the first case, the space symbol is written to stack mov BYTE PTR [rsp+15], 32 and then the function is called on this address. Just because this symbol is written to the stack, the space on it must be previously allocated and, later, deallocated. So that's why the call command is used in the first case, instead of lighter jmp: we have to clear the stack after the call: add rsp, 24. So, contrary to expectations, printing a symbol takes more time than printing a string literal.

Why does this happen? Why the symbol is not stored in memory? Why the optimizer haven't chosen char-specific function to call?

I'm not convinced any of these programs "takes more time" than the other... For example, can there be a cache miss when accessing `.LC0`? — dyp, Feb 11 '21 at 15:43
Of course, cache miss might take place, and might not: string of space symbol are used quite often in programs while printing. But I think in any of that case the print of a character should be faster than printing of a string. Possible cache miss is not a reason to choose this way of printing: commands that are used in bar is a subset of functions that are used in foo function. That would have been solved by using `std::ostream::put` with char argument in a register. I think cache miss is the reason symbol is not stored in data section. — Daniel Z., Feb 11 '21 at 16:15
I don't think the optimizer is smart enough to understand functions to a degree that it can just replace one function call with another. So the stdlib implementation probably says "call __ostream_insert for a string" when you insert a single character. — dyp, Feb 11 '21 at 16:17
If optimizer is disabled, two different functions are chosen by overload resolution. But with enabled optimizer they both end up in function that prints a string (it takes length parameter). So for a single character that is far from optimal. — Daniel Z., Feb 11 '21 at 16:19
The optimizer is heavily inlining. I would expect the same function to be called both in optimized and unoptimized builds after lots of layers of functions (except if someone is doing `#ifdef NDEBUG` shenanigans) — dyp, Feb 11 '21 at 16:37
Note that libc++ similarly uses the same function, but clang optimizes differently and only "uses up" 8 bytes of stack: https://compiler-explorer.com/z/5zKo54 — dyp, Feb 11 '21 at 16:49

Useless · Answer 1 · 2021-02-11T22:46:09.163

2

Printing a symbol takes more time than printing a string literal

You mean Printing a char takes more instructions than printing a string literal.

If you want to claim it takes more time, you have to actually time it.

And just because the first one is the subset of the second one, the only reason I see why it can be slower is the cache-misses. So it seems that it, really, would be slower, if there were no miss.

Speculating about how things might perform is a waste of time that could be spent actually profiling it. If you don't care enough to profile it, then it's not important enough to worry about in the first place.

A cache miss is certainly much more likely with a seldom-used string literal than with a local stack variable, so "if there were no miss" is a pretty big assumption.

Why does this happen? Why the symbol is not stored in memory?

You can always check the code. In ostream line 507 for version 10.2, we see

  //@{
  /**
   *  @brief  Character inserters
   *  @param  __out  An output stream.
   *  @param  __c  A character.
   *  @return  out
   *
   *  Behaves like one of the formatted arithmetic inserters described in
   *  std::basic_ostream.  After constructing a sentry object with good
   *  status, this function inserts a single character and any required
   *  padding (as determined by [22.2.2.2.2]).  @c __out.width(0) is then
   *  called.
   *
   *  If @p __c is of type @c char and the character type of the stream is not
   *  @c char, the character is widened before insertion.
  */
  template<typename _CharT, typename _Traits>
    inline basic_ostream<_CharT, _Traits>&
    operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c)
    { return __ostream_insert(__out, &__c, 1); }

So, the reason the symbol isn't "stored in memory" (except as the literal 32 in the MOV instruction) is that the library explicitly takes its address, which forces temporary materialization.

Why the optimizer haven't chosen char-specific function to call?

The optimizer doesn't choose which function to call in the first place. It isn't involved until after the overload is selected and temporary materialization has already happened. Expecting it to identify a completely different code path that might have generated fewer instructions is asking a lot - especially when there's no solid reason to prefer that version apart from your aesthetic preferences.

The reason the library writers chose to do this, is that they can reuse the same __ostream_insert code, which you can read here. As a library implementer this makes sense because, as you can see, it's not trivial. Single chars are still formatted output, with the same sentry, stream state and padding logic as strings.

If you wanted an unformatted char output, you should be using ostream::put(char) instead anyway, whose implementation is rather simpler.

edited Feb 11 '21 at 22:46

answered Feb 11 '21 at 16:31

Useless

64,155
6
88
132

It allocates 24 bytes of stack just to write a single character, which value is encoded in mov command. The reason why string call contains less commands is that string is stored directly in memory, so there is no reason for stack allocation. It could have been stored, by, it seems, that optimizer is afraid of cache-misses. But that is not a reason not to use std::ostream::put while having character in a register. – Daniel Z. Feb 11 '21 at 16:34
I answered the question "why does this code produce this output", which I believe is what you asked. _Now_ you're asking instead _why doesn't the optimizer perform this other transformation_, which is a much harder question to answer. – Useless Feb 11 '21 at 16:36
Yes, I mean it takes more instructions. And just because the first one is the subset of the second one, the only reason I see why it can be slower is the cache-misses. So it seems that it, really, would be slower, if there were no miss. There three questions in the end of the first post. – Daniel Z. Feb 11 '21 at 16:38
"the reason the symbol isn't "stored in memory"" That is the only reason it has much more instructions – Daniel Z. Feb 11 '21 at 16:40
Note that there are requirements on stack alignment for each function call. You can't just allocate a single byte on the stack. As to why they end up with 24, ‍♀️ IIRC alignment requirement is 16 on x86_64 Linux. – dyp Feb 11 '21 at 16:40
@DanielZ. `It allocates 24 bytes of stack`, the term allocating is a bit misleading, it moves the information where on the stack you are but does not do allocation (in the sense the memory is requested by the system). Replacing `operator<<` with `std::ostream::put` would only work if the compiler would know that these functions are identical. This might be possible to be done for the std, but that it is not done indicates that there is no need for such a case to do this kind of specialization. – t.niese Feb 11 '21 at 16:44
@t.niese If the optimizer is disabled, it chooses two template instances: one for char and the other for const char*. But the optimizer converts both of this calls to a function, that is intended for printing strings, and for a single character it seems to have an unnecessary overhead. – Daniel Z. Feb 11 '21 at 16:44
@DanielZ. yes because for `char` it is `operator<<(basic_ostream<_CharT, _Traits>& __out, _CharT __c) { return __ostream_insert(__out, &__c, 1); }` so the compiler inlines the `return __ostream_insert(__out, &__c, 1);`. Both `char` and `char *` call `__ostream_insert`. – t.niese Feb 11 '21 at 16:45
@t.niese So we come to the following question: why it uses __ostream_insert instead of put? – Daniel Z. Feb 11 '21 at 16:46
@Useless I understand how and why materialization takes place. But the reason "why" is takes place dissolves if we put this symbol in register during the call to std::ostream::put. – Daniel Z. Feb 11 '21 at 16:47
@DanielZ. because that's how this particular std library is implemented. That's not the problem of the compiler that the std library uses `__ostream_insert(__out, &__c, 1);` instead of `put`. The compiler could create a specialized assumption for the std that `put` and `__ostream_insert(__out, &__c, 1);` are equal, but either that's not wanted or does not improve anything. – t.niese Feb 11 '21 at 16:47
@t.niese I wanted to understand the logic behind this: we have a specialized function for a character printing, but the gcc std developers have chosen to use string printing function. Seems kinda weird? Leads to thoughts that there is hidden logic behind this. – Daniel Z. Feb 11 '21 at 16:49
2

@DanielZ. use `std::cout << std::right << std::setw(13);` before you write anything to `std::cout` and you'll see that `std::cout.put('.');` and `std::cout << '.';` are different. – t.niese Feb 11 '21 at 17:08
@t.niese Wow, thats really what I was not thinking about. Now it seems clear! Thank you! – Daniel Z. Feb 11 '21 at 17:13

Printing a character literal is more complicated than printing a string literal

1 Answers1