31

In this answer we can read that:

I suppose there's little difference between using '\n' or using "\n", but the latter is an array of (two) characters, which has to be printed character by character, for which a loop has to be set up, which is more complex than outputting a single character.

emphasis mine

That makes sense to me. I would think that outputting a const char* requires a loop which will test for null-terminator, which must introduce more operations than, let's say, a simple putchar (not implying that std::cout with char delegates to calling that - it's just a simplification to introduce an example).

That convinced me to use

std::cout << '\n';
std::cout << ' ';

rather than

std::cout << "\n";
std::cout << " ";

It's worth to mention here that I am aware of the performance difference being pretty much negligible. Nonetheless, some may argue that the former approach carries intent of actually passing a single character, rather than a string literal that just happened to be a one char long (two chars long if you count the '\0').

Lately I've done some little code reviews for someone who was using the latter approach. I made a small comment on the case and moved on. The developer then thanked me and said that he hadn't even thought of such difference (mainly focusing on the intent). It was not impactful at all (unsurprisingly), but the change was adopted.

I then began wondering how exactly is that change significant, so I ran to godbolt. To my surprise, it showed the following results when tested on GCC (trunk) with -std=c++17 -O3 flags. The generated assembly for the following code:

#include <iostream>

void str() {
    std::cout << "\n";
}

void chr() {
    std::cout << '\n';
}

int main() {
    str();
    chr();
}

surprised me, because it appears that chr() is actually generating exactly twice as many instructions as str() does:

.LC0:
        .string "\n"
str():
        mov     edx, 1
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:_ZSt4cout
        jmp     std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
chr():
        sub     rsp, 24
        mov     edx, 1
        mov     edi, OFFSET FLAT:_ZSt4cout
        lea     rsi, [rsp+15]
        mov     BYTE PTR [rsp+15], 10
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        add     rsp, 24
        ret

Why is that? Why both of them eventually call the same std::basic_ostream function with const char* argument? Does it mean that the char literal approach is not only not better, but actually worse than string literal one?

Fureeish
  • 12,533
  • 4
  • 32
  • 62
  • Interesting that, for both versions, is calling the `char*` version of `ostream::insert`. (Is there a single-char overload?) What optimization level was used when generating the assembly? – 3Dave Jul 28 '19 at 19:58
  • 2
    @3Dave it seems that there is no `char` overload. GCC and Clang delegate to `const char*` overload, but MSVS (thanks @PaulSanders) provides an additional [optimisation](https://godbolt.org/z/AQiyMw). As for the optimisation level, I specified that in the question - I used `GCC 8.2.0` with `-O3`. – Fureeish Jul 28 '19 at 20:02
  • Given that you're doing I/O, the performance difference is not just negligible but down in the noise. – user207421 Jul 29 '19 at 00:25
  • 1
    @Bohemian I think OP is counting the null character terminating the array, as alluded to later in the question. – stewbasic Jul 29 '19 at 00:57
  • 1
    @Bohemian: The static storage for the string literal `"\n"` consists of 2 bytes: `0xa` (the newline) and `0` (the terminator). A 2-byte array is a good description of it. (I'm assuming a "normal" ASCII/UTF-8 C++ implementation like g++ for x86-64 where char = byte.) A pointer to this implicit-length string/array is passed to the ostream operator. – Peter Cordes Jul 29 '19 at 03:36
  • @user207421: The overall code-size difference can add up over many use-cases in a large codebase. Smaller executables are smaller to download, and smaller to load from disk, and have better TLB locality. Pretty hard to measure a difference, but in widely-used software every little bit helps. – Peter Cordes Jul 29 '19 at 03:39

3 Answers3

33

None of the other answers really explain why the compiler generates the code it does in your Godbolt link, so I thought I'd chip in.

If you look at the generated code, you can see that:

std::cout << '\n';

Compiles down to, in effect:

const char c = '\n';
std::cout.operator<< (&c, 1);

and to make this work, the compiler has to generate a stack frame for function chr(), which is where many of the extra instructions come from.

On the other hand, when compiling this:

std::cout << "\n";

the compiler can optimise str() to simply 'tail call' operator<< (const char *), which means that no stack frame is needed.

So your results are somewhat skewed by the fact that you put the calls to operator<< in separate functions. It's more revealing to make these calls inline, see: https://godbolt.org/z/OO-8dS

Now you can see that, while outputting '\n' is still a little more expensive (because there is no specific overload for ofstream::operator<< (char)), the difference is less marked than in your example.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
  • Good answer. It really amazes me that, by default, outputting `char`s really delegates to outputting `const char*`. C++ seems to be performance-focused and such things, while usually negligible, still slip through... – Fureeish Jul 28 '19 at 17:57
  • 5
    @Fureeish Yes, I was surprised too. I checked briefly in Godbolt and Clang does the same thing as gcc. MSVC, on the other hand, appears to have a specific overload `operator<< (char)`, see: https://godbolt.org/z/AQiyMw – Paul Sanders Jul 28 '19 at 19:44
  • @PaulSanders: same; I'd assumed normal C++ library would include an ostream equivalent of `fputc` that took a char by value. But apparently only MSVC does, out of the 3 major ones for x86 (MSVCRT, libstdc++, and libc++). I checked libc++ on Godbolt (`clang -stdlib=libc++` https://godbolt.org/z/sDDgsC) and it always uses a `char*` + length function for strings as well as characters. (For unknown string lengths, it runs `strlen` first). So internally I guess its iostream library just has to work with explicit-length buffers so it can memcpy instead of strcpy. – Peter Cordes Jul 29 '19 at 03:52
  • Compilers have a missed optimization that they don't just `push 0xa` / `mov rsi,rsp` to store + reserve space for the character; instead they `sub rsp, ??` and separately do a byte store, then need an LEA to copy the address. Silly compilers. Inside a larger function that makes sense, though; the usually want RSP aligned by 16 so a push would misalign it. This becomes a special case of the general missed optimization of not using `push` to store initial values for variables that are being spilled / initialized in memory right away on function entry. – Peter Cordes Jul 29 '19 at 03:56
  • 3
    People often forget that `<<` is *formatted* output - it is required to pad your character according to the stream's width/fill/flags. That's a decent chunk of code that can be reused between `char` and `const char*`, so I'm not really surprised that they share a common implementation. If you just want to output a single character, there's the unformatted `put`. – T.C. Jul 30 '19 at 04:03
7

Keep in mind though that what you see in the assembly is only the creation of the callstack, not the execution of the actual function.

std::cout << '\n'; is still much slightly faster than std::cout << "\n";

I've created this little program to measure the performance and it's about 20 times slightly faster on my machine with g++ -O3. Try it yourself!

Edit: Sorry noticed typo in my program and it's not that much faster! Can barely measure any difference anymore. Sometimes one is faster. Other times the other.

#include <chrono>
#include <iostream>

class timer {
    private:
        decltype(std::chrono::high_resolution_clock::now()) begin, end;

    public:
        void
        start() {
            begin = std::chrono::high_resolution_clock::now();
        }

        void
        stop() {
            end = std::chrono::high_resolution_clock::now();
        }

        template<typename T>
        auto
        duration() const {
            return std::chrono::duration_cast<T>(end - begin).count();
        }

        auto
        nanoseconds() const {
            return duration<std::chrono::nanoseconds>();
        }

        void
        printNS() const {
            std::cout << "Nanoseconds: " << nanoseconds() << std::endl;
        }
};

int
main(int argc, char** argv) {
    timer t1;
    t1.start();
    for (int i{0}; 10000 > i; ++i) {
        std::cout << '\n';
    }
    t1.stop();

    timer t2;
    t2.start();
    for (int i{0}; 10000 > i; ++i) {
        std::cout << "\n";
    }
    t2.stop();
    t1.printNS();
    t2.printNS();
}

Edit: As geza suggested I tried 100000000 iterations for both and sent it to /dev/null and ran it four times. '\n' was once slower and 3 times faster but never by much, but it might be different on other machines:

Nanoseconds: 8668263707
Nanoseconds: 7236055911

Nanoseconds: 10704225268
Nanoseconds: 10735594417

Nanoseconds: 10670389416
Nanoseconds: 10658991348

Nanoseconds: 7199981327
Nanoseconds: 6753044774

I guess overall I wouldn't care too much.

Michael Mahn
  • 737
  • 4
  • 11
  • "*Keep in mind though that what you see in the assembly is only the creation of the callstack*" - while some operations are faster than others in assembly, both cases delegate to the same call, but the `char` version first requires more instructions. This, ultimately, will make it slower. What's your compiler? On my machine (MinGW, GCC 8.2.0, `-O3`) the results are: `1563570000` for `char` approach and `821538000` for string literal approach, which makes the `char` version around **2 times slower**. The 2x ratio is consistent for the more tests. – Fureeish Jul 28 '19 at 13:07
  • On MSVC 19.16, the `char` version is around 10(!) times slower on my machine. – zett42 Jul 28 '19 at 13:11
  • 1
    True, they delegate to the same call, but `std::cout << "Hello World";` would also delegate to the same call. But the size changes: for '\n' is 1, for "\n" is 2, for "Hello World" is 12. The execution of the function can still be faster there. Sorry, I also had a type in my program at first where I had ' ' instead of '\n'. Now I've increased to 1000000 iteration and '\n' is still consistently faster although only slightly. – Michael Mahn Jul 28 '19 at 13:13
  • 4
    You should use a much larger iteration count, and redirect stdout to some file or `/dev/null`. Then the difference should be much smaller (as we don't want to benchmark noise, CPU's dynamic frequency scaling, console output speed and such). – geza Jul 28 '19 at 13:16
  • @zett42 Hm, ok interesting on my machine Linux Ubuntu with g++ 7.4 '\n' is faster than "\n" but not by much. Guess I edit my answer then. – Michael Mahn Jul 28 '19 at 13:17
  • 3
    Have run the 100000000 iterations with MSVC build and redirection to NUL. Now the tables have turned. During 3 runs, `char` output was always faster, but not by much, on average the `char` output was 1,7% faster. – zett42 Jul 28 '19 at 13:41
  • @zett42 Yeah, I also didn't notice much of a difference. Definitely an interesting theoretical question, but I think for the real world, it's mostly irrelevant. – Michael Mahn Jul 28 '19 at 13:49
  • 3
    Also I would argue that `"\n"` is easier to maintain, when one needs to insert characters before the line break. MSVC doesn't complain if I write `'xyz\n'`, which could easily be overlooked in a hurry. – zett42 Jul 28 '19 at 13:53
  • 2
    Your performance results are almost certainly dominated by flushing because you haven't turned off `sync_with_stdio`. – Ben Voigt Jul 28 '19 at 21:31
  • @BenVoigt Sure, but I mean unless you print so much to the console that it becomes performance critical, you usually also don't turn off sync, do you? I've tried it with turned off sync now and there is barely any difference between a char and a char*. Sometimes one is slightly faster, other times the other. But turning off sync gives itself an about 30% performance boost. – Michael Mahn Jul 29 '19 at 06:05
  • 1
    @MichaelMahn: Do you see any speed difference between `'A'` and `'\n'`? When sync is turned on, `\n` (whether standalone character or in a string) should be causing flushing. – Ben Voigt Jul 29 '19 at 13:09
  • @BenVoigt Depends, if I redirect to /dev/null, there is no measurable difference between 'A' and '\n'. Maybe on a realtime system. If I print to the console '\n' is about 14 times slower when syncing with stdio, and 6 times slower without syncing. – Michael Mahn Jul 30 '19 at 19:36
5

Yes, for this particular implementation, for your example, char version is a little bit slower than the string version.

Both versions call a write(buffer, bufferSize) style function. For the string version, bufferSize is known at compile time (1 byte), so there is no need to find the zero terminator run-time. For the char version, the compiler creates a little 1-byte buffer on stack, puts the character into it, and passes this buffer to write out. So, the char version is a little bit slower.

geza
  • 28,403
  • 6
  • 61
  • 135
  • 1
    Same for Clang. MSVS compiles both to the same assembly. This really seems weird, especially since the `char` version seems to be universally advised. – Fureeish Jul 28 '19 at 12:42
  • 1
    @Fureeish: it should not matter too much. The difference is very much negligible. The whole writing should take a much more time (even if it is buffered) than creating the 1-byte little buffer. – geza Jul 28 '19 at 12:45
  • I know that should not matter too much and that's probably why there are no optimisations introduced to the `char` approach. If noone comes with an unexpected explanation, I will happily accept your answer in the near future. – Fureeish Jul 28 '19 at 12:48
  • Couldn't the char have been stored in const memory, like the string was, and then just pass its address? Indeed, it might even detect that `'\n'` is the first character of `"\n"` and use `.LC0` for both? – Barmar Jul 30 '19 at 19:34
  • @Barmar: if the compiler can prove that it wouldn't cause any problems, then yes, this transformation can be done. But proving this is not trivial. For example, `operator<<` could call (in theory) `chr()`. Which should create another buffer, and not use the previous one (because, in theory, `operator<<` could store the address of the buffer, and expect that it will change, if `chr()` calls it again). – geza Jul 30 '19 at 21:43
  • @geza It's a constant, it can't change. And it uses a `const char *` overload of `operator<<`, so it won't change it. – Barmar Jul 30 '19 at 21:50
  • @Barmar: Sure. I meant the address of the buffer. In theory, `operator<<` could expect, that the incoming buffer's address changes between recursive calls. – geza Jul 30 '19 at 22:51