85

I was trying to solve this exercise from www.spoj.com : FCTRL - Factorial

You don't really have to read it, just do it if you are curious :)

First I implemented it in C++ (here is my solution):

#include <iostream>
using namespace std;

int main() {
    unsigned int num_of_inputs;
    unsigned int fact_num;
    unsigned int num_of_trailing_zeros;

    std::ios_base::sync_with_stdio(false); // turn off synchronization with the C library’s stdio buffers (from https://stackoverflow.com/a/22225421/5218277)

    cin >> num_of_inputs;

    while (num_of_inputs--)
    {
        cin >> fact_num;

        num_of_trailing_zeros = 0;

        for (unsigned int fives = 5; fives <= fact_num; fives *= 5)
            num_of_trailing_zeros += fact_num/fives;

        cout << num_of_trailing_zeros << "\n";
    }

    return 0;
}

I uploaded it as the solution for g++ 5.1

The result was: Time 0.18 Mem 3.3M C++ execution results

But then I saw some comments which claimed that their time execution was less than 0.1. Since I couldn't think about faster algorithm I tried to implement the same code in C:

#include <stdio.h>

int main() {
    unsigned int num_of_inputs;
    unsigned int fact_num;
    unsigned int num_of_trailing_zeros;

    scanf("%d", &num_of_inputs);

    while (num_of_inputs--)
    {
        scanf("%d", &fact_num);

        num_of_trailing_zeros = 0;

        for (unsigned int fives = 5; fives <= fact_num; fives *= 5)
            num_of_trailing_zeros += fact_num/fives;

        printf("%d", num_of_trailing_zeros);
        printf("%s","\n");
    }

    return 0;
}

I uploaded it as the solution for gcc 5.1

This time the result was: Time 0.02 Mem 2.1M C execution results

Now the code is almost the same, I added std::ios_base::sync_with_stdio(false); to the C++ code as was suggested here to turn off the synchronization with the C library’s stdio buffers. I also split the printf("%d\n", num_of_trailing_zeros); to printf("%d", num_of_trailing_zeros); printf("%s","\n"); to compensate for double call of operator<< in cout << num_of_trailing_zeros << "\n";.

But I still saw x9 better performance and lower memory usage in C vs. C++ code.

Why is that?

EDIT

I fixed unsigned long to unsigned int in the C code. It should have been unsigned int and the results which are shown above are related to the new (unsigned int) version.

Community
  • 1
  • 1
Alex Lop.
  • 6,810
  • 1
  • 26
  • 45
  • 32
    [C++ streams are extremely slow by design.](http://stackoverflow.com/questions/4340396/does-the-c-standard-mandate-poor-performance-for-iostreams-or-am-i-just-deali) Because slow and steady wins the race. :P (*runs before I get flamed*) – Mysticial Dec 06 '15 at 20:19
  • @Mysticial: A *better* definition is that the C++ streams are designed to be safer and more adaptable. The `printf` family can't print custom classes. A custom class can be printed by `cout` if the class has overridden the extraction `operator<<`. This adaptability may make the streams slower. – Thomas Matthews Dec 06 '15 at 20:22
  • 7
    The slowness doesn't come from safety or adaptability. It's way overdesigned with all the stream flags. – Karoly Horvath Dec 06 '15 at 20:23
  • But on the other hand according to this: http://stackoverflow.com/a/12044847/5218277 there shouldn't be significant difference between `std::cout` and `printf` ... And again, I can accept that `std::cout` might be twice slower than `printf` maybe a little more in complicated expressions, but here the expressions are very basic! It doesn't explain **x9** performance impact in my opinion. – Alex Lop. Dec 06 '15 at 20:28
  • 2
    @ThomasMatthews: That's a bold statement! `printf` is much more appropriate for formatting textual output than the C++ overloaded `<<` operator. Just try printing an integer in hex with at least 4 digits... A more versatile interface can hardly justify such a difference in performance, especially since the C++ alternative does not have to determine the types dynamically. – chqrlie Dec 06 '15 at 20:28
  • what is the purpose of the magical spell `std::ios_base::sync_with_stdio(false)`? what is the performance without it? – chqrlie Dec 06 '15 at 20:29
  • 3
    @ThomasMatthews This type safety has no runtime cost in C++, as the types are resolved at compile time. This alone would even be *faster* than `printf()`, which resolves format flags at runtime. The problem is that C++ streams originate from the early 1990s, and not all of the design decisions made back then were ideal. – TheOperator Dec 06 '15 at 20:29
  • 1
    @chqrlie: Given a class `Point` with members `x` and `y`, print it out using `printf` without accessing the members. If I define an `operator<<` method for Point, I can use `cout << Point(x,y)< – Thomas Matthews Dec 06 '15 at 20:30
  • @chqrlie Without `std::ios_base::sync_with_stdio(false)` the results: *Time* 0.22 *Mem* 3.4M – Alex Lop. Dec 06 '15 at 20:30
  • 1
    There is a type mismatch in your C version: `fact_num` is defined as `unsigned long`, but you use the type `%d` to parse a value with `scanf`. If `int` and `long` are nor the same size, you invoke undefined behavior and fixing the scanf format would actually penalize the C version. – chqrlie Dec 06 '15 at 20:31
  • 1
    @ThomasMatthews: that's correct, but if you want to parameterize the string conversion, you end up writing the same code as you would in C. overloading the `<<` to do formatted output is simplistic. – chqrlie Dec 06 '15 at 20:35
  • @ThomasMatthews I'm not sure what you're saying. There is no runtime cost for overloading operators or other functions. The compiler figures out what function to call. – Kevin Dec 06 '15 at 20:36
  • 3
    Did you do your own tests on your PC? Do you have a clue what kind of input is fed into your program? The performance of your program is highly tied to the perfomance of input/output. I suspect a bufferization issue. Try to disable bufferization of `stdout` in the C version with `setbuf(stdout, NULL);` – chqrlie Dec 06 '15 at 20:37
  • 1
    @chqrlie Thanks for bringing up the `unsigned long` issue. It was a typo in the post. The real code has `unsigned int` and the results are related to the `unsigned int` version. – Alex Lop. Dec 06 '15 at 20:38
  • "almost identical" is a pretty generous description – M.M Dec 06 '15 at 20:39
  • @chqrlie I didn't do any tests on my PC as I have no clue what the input which is supplied by spoj.com BUT with `setbuf(stdout, NULL);` for the C code the results significantly dropped. *Time* 0.34 *Mem* 2.1M – Alex Lop. Dec 06 '15 at 20:42
  • @ThomasMatthews `printf` and family can't be extended either, apart from implementation-specific ways. `FILE` can only actually deal with file streams. The C++ I/O system allows any type of data source/destination to be used, which is not to be underrated. – edmz Dec 06 '15 at 20:46
  • Can you try reading all the inputs into a vector, and then compute and write all the outputs. This should speed up the C++ version. – chqrlie Dec 06 '15 at 20:46
  • @chqrlie Thanks for the input. I can do it (reading all the inputs into a vector, and then compute and write all the outputs). But I am not looking to speed up the C++ version, I am just trying to understand what causes the C code run 9 times faster than the C++ one... – Alex Lop. Dec 06 '15 at 20:49
  • 1
    I understand that. Disabling buffering on the C version shows the drastic improvement buffering brings. I suspect C++ flushes the output on `cout` when input is requested from `cin`. Some C libraries do this as well, but usually only when reading/writing to and from the terminal. Reading all the inputs at once could help finding out why the C++ version is so much slower. I suspect it is strictly a buffering issue. – chqrlie Dec 06 '15 at 20:55
  • 1
    Comparing apples and oranges. – too honest for this site Dec 06 '15 at 21:01
  • 2
    @Olaf: perfectly valid comparison and the observed difference of performance definitely deserves investigating. – chqrlie Dec 06 '15 at 21:05
  • 8
    @AlexLop. Using a `std::ostringstream` to accumulate the output and sending it to `std::cout` _all at once_ at the end brings the time down to 0.02. Using `std::cout` in a loop is simply slower in their environment and I don't think there's a simple way to improve it. – Blastfurnace Dec 06 '15 at 21:10
  • 1
    @chqrlie: As much as investigating the nutrition of apples and oranges, which one is more healthy, etc. But that is too broad for SO. Just porting code from one language to another using the same constructs often leads to such results. Just because you cannot just use similar **looking** functions. – too honest for this site Dec 06 '15 at 21:19
  • 6
    Is noone else concerned by the fact that these timings were obtained using ideone? – ildjarn Dec 06 '15 at 21:21
  • @The Operator Middle 1980s actually, but certainly a very poor piece of work. – user207421 Dec 06 '15 at 21:29
  • 6
    @Olaf: I'm afraid I disagree, this kind of question is very much on topic for all the chosen tags. C and C++ are sufficiently close in general that such a difference in performance begs for an explanation. I'm glad we found it. Maybe the GNU libc++ should be improved as a consequence. – chqrlie Dec 06 '15 at 21:49
  • 1
    @this: `scanf` has its own bag of flaws and shortcomings, too late to fix them because they have been bolted into the Standard many years ago. The issue seems only related to the buffering scheme, namely the flushing of output before an input request. Such a feature is not irrational. Standard C does not mandate it and the C libraries that implement it seem to only do it for terminal input and output. I don't know what the C++ standard says about that. It may be more explicit, making some programs much less efficient. – chqrlie Dec 06 '15 at 23:02
  • This question just made me spew coffee a little. I think most of us here sympathize with your sense of disbelief. Don't worry, there are enough landmines scattered throughout C++ to keep you shaking your head and chuckling to yourself for the rest of your career. – zxq9 Dec 07 '15 at 02:07
  • Try the same with `printf` in C++ version and post the result. – i486 Dec 07 '15 at 09:54
  • @i486 with `printf` instead of `cout` the result is similar to the C one *Time* 0.02 – Alex Lop. Dec 07 '15 at 09:59
  • @AlexLop. My opinion is that `cin`/`cout` are only for students. But when I say this get negative comments. Use `printf` - you get precise control, and better readability of program, not only speed. – i486 Dec 07 '15 at 10:06
  • @i486 :) This is something that is not 100% proven :) `cout` has its advantages over `printf` and visa versa. I have read many posts and articles where very qualified C++ specialists recommended to use `cout` over `printf`. – Alex Lop. Dec 07 '15 at 10:09
  • @AlexLop. Everybody can make his choice. But I cannot find a feature available in `cout` which cannot be done with `printf`. Maybe some exotic formatting... The code readability and predictability of `printf` is much better for me. And each time I see `cout` samples, they look like "Hello world" programs. Cannot accept this like part of serious work :) – i486 Dec 07 '15 at 11:01
  • Try uncoupling the input/output streams so they work independently. By default the std::cout needs to be flushed before any input is read from std::cin (otherwise a user may not be able to read the instructions for what he is supposed to input). http://en.cppreference.com/w/cpp/io/manip/unitbuf – Martin York Dec 08 '15 at 01:27
  • I heard someone say about C++ "You don't pay for what you don't use". Is this not actually true? – Vlad Dec 09 '15 at 00:00
  • @Vlad But I paid for what I actually used... I just was not aware that each call for `cin` causes `cout` buffer to be flushed, which is appeared to be as an expensive operation. – Alex Lop. Dec 09 '15 at 04:38

3 Answers3

57

Both programs do exactly the same thing. They use the same exact algorithm, and given its low complexity, their performance is mostly bound to efficiency of the input and output handling.

scanning the input with scanf("%d", &fact_num); on one side and cin >> fact_num; on the other does not seem very costly either way. In fact it should be less costly in C++ since the type of conversion is known at compile time and the correct parser can be invoked directly by the C++ compiler. The same holds for the output. You even make a point of writing a separate call for printf("%s","\n");, but the C compiler is good enough to compile this as a call to putchar('\n');.

So looking at the complexity of both the I/O and computation, the C++ version should be faster than the C version.

Completely disabling the buffering of stdout slows the C implementation to something even slower than the C++ version. Another test by AlexLop with an fflush(stdout); after the last printf yields similar performance as the C++ version. It is not as slow as completely disabling buffering because output is written to the system in small chunks instead of one byte at a time.

This seems to point to a specific behavior in your C++ library: I suspect your system's implementation of cin and cout flushes the output to cout when input is requested from cin. Some C libraries do this as well, but usually only when reading/writing to and from the terminal. The benchmarking done by the www.spoj.com site probably redirects input and output to and from files.

AlexLop did another test: reading all the inputs at once in a vector and subsequently computing and writing all output helps understanding why the C++ version is so much slower. It increases performance to that of the C version, this proves my point and removes suspicion on the C++ formatting code.

Another test by Blastfurnace, storing all outputs in an std::ostringstream and flushing that in one blast at the end, does improve the C++ performance to that of the basic C version. QED.

Interlacing input from cin and output to cout seems to cause very inefficient I/O handling, defeating the stream buffering scheme. reducing performance by a factor of 10.

PS: your algorithm is incorrect for fact_num >= UINT_MAX / 5 because fives *= 5 will overflow and wrap around before it becomes > fact_num. You can correct this by making fives an unsigned long or an unsigned long long if one of these types is larger than unsigned int. Also use %u as the scanf format. You are lucky the guys at www.spoj.com are not too strict in their benchmarks.

EDIT: As later explained by vitaux, this behavior is indeed mandated by the C++ standard. cin is tied to cout by default. An input operation from cin for which the input buffer needs refilling will cause cout to flush pending output. In the OP's implementation, cin seems to flush cout systematically, which is a bit overkill and visibly inefficient.

Ilya Popov provided a simple solution for this: cin can be untied from cout by casting another magical spell in addition to std::ios_base::sync_with_stdio(false);:

cin.tie(nullptr);

Also note that such forced flush also occurs when using std::endl instead of '\n' to produce an end of line on cout. Changing the output line to the more C++ idiomatic and innocent looking cout << num_of_trailing_zeros << endl; would degrade performance the same way.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 2
    You're probably right about stream flushing. Collecting the output in a `std::ostringstream` and outputting it all once at the end brings the time down to parity with the C version. – Blastfurnace Dec 06 '15 at 21:18
  • When you write `"QED"` what do you mean? ("which is what had to be proven")? – David C. Rankin Dec 06 '15 at 21:19
  • Looks like you figured it out.... with fflush after the last `printf` the C performantce drops to 0.20 Which is very close to the C++ one. I'll try to collect the inputs into a vector and process them all together. I'll update in a minute. – Alex Lop. Dec 06 '15 at 21:20
  • 2
    @ DavidC.Rankin: I ventured a conjecture (cout is flushed upon reading cin), devised a way to prove it, AlexLop implemented it and it does give convincing evidence, but Blastfurnace came up with a different way to prove my point and his tests give equally convincing evidence. I take it for proof, but of course it is not a completely formal proof, looking at the C++ source code could. – chqrlie Dec 06 '15 at 21:28
  • 2
    I tried using `ostringstream` for the output and it gave *Time* 0.02 QED :). Regarding the `fact_num >= UINT_MAX / 5`, GOOD point! – Alex Lop. Dec 06 '15 at 21:43
  • 1
    Collecting all the inputs into a `vector` and then processing the calculations (without `ostringstream`) gives the same result! *Time* 0.02. Combining both `vector` and `ostringstream` doesn't improve it more. Same *Time* 0.02 – Alex Lop. Dec 06 '15 at 21:52
  • 2
    A simpler fix that works even if `sizeof(int) == sizeof(long long)` is this: add a test in the body of the loop after `num_of_trailing_zeros += fact_num/fives;` to check if `fives` has reached its maximum: `if (fives > UINT_MAX / 5) break;` – chqrlie Dec 06 '15 at 21:53
  • 1
    @chqrlie You can update your answer with my latest results, as you suggested, which support/prove your point. I accept it as the answer to my question. Thank you! – Alex Lop. Dec 06 '15 at 21:56
  • 1
    You can improve performance of interleaved input/output by untying `cin` and `cout`: `std::cin.tie(nullptr);` – Ilya Popov Dec 07 '15 at 01:22
  • "I suspect your system's implementation of cin and cout flushes the output to cout when input is requested from cin." - this is required by the C++ standard, unless you include code to untie it – M.M Dec 07 '15 at 01:43
  • @M.M: It is indeed required by the C++ standard, but only if the input buffer is empty: C++14 27.7.2.1.3 Class basic_istream::sentry: *First, if `is.tie()` is not a null pointer, the function calls `is.tie()->flush()` to synchronize the output sequence with any associated external C stream. Except that this call can be suppressed if the put area of `is.tie()` is empty. Further an implementation is allowed to defer the call to flush until a call of `is.rdbuf()->underflow()` occurs. If no such call occurs before the sentry object is destroyed, the call to flush may be eliminated entirely.* – chqrlie Dec 07 '15 at 06:59
  • `std::cin.tie(nullptr);` spell should be used together with, not instead of the `std::ios_base::sync_with_stdio(false);` spell. – Ilya Popov Dec 07 '15 at 15:18
  • @IlyaPopov: you are the spellmaster `:)` – chqrlie Dec 07 '15 at 15:26
  • 1
    When would you _not_ want to flush `cout` when prompting for input? Well, I suppose if input was redirected from a file, but can that actually be identified inside the program or is that strictly a shell implementation? – JAB Dec 07 '15 at 18:19
  • @JAB: if input is redirected from a file, it will be buffered typically one system pagesize at a time. There is no need to flush `cout` when input is already available in `cin`'s buffer. This behavior is common for the C library and is allowed for the C++ library, even when `cout` is tied to `cin`, as you can read from the C++ Standard quote above. This buffering scheme reduces the number of system calls by a factor of 100 to 500 in this particular case. – chqrlie Dec 07 '15 at 20:32
  • @chqrlie But how do you identify whether `cin` is actually being supplied as user input or comes from a file? To my knowledge, `istream` does not provide the ability to query for the buffer state. – JAB Dec 07 '15 at 22:01
  • @JAB: The program does not need to know. `cin` may not document it, but its implementation can determine whether the system file descriptor is that of a file or a pipe or a device, tty, pseudo-tty, etc. The runtime startup code does this type of verification. But even without such knowledge, reading a block from a file will return a full block if not at end of file, whereas a terminal will typically return a single line at a time. – chqrlie Dec 07 '15 at 22:49
  • @chqrlie So you can just read a block from `cin` and use that to determine whether or not the stream should be untied. (Which I just realized was essentially what you stated in your first reply to me, just on a per-call basis; somehow I read it wrong the first time. Sorry about that.) – JAB Dec 07 '15 at 22:54
  • @NAB: not really: if you read a block from `cin`, it will return what you ask, makeing as many system calls are needed to obtain the number of bytes requested. There might be other ways to tell whether `cin` is reading from a file or a terminal, but that was not the issue we were investigating. Untying `cin` from `cout` prevents the default behavior which proves inefficient in the OPs benchmarks. In this case, it would not have adverse effects, even from a terminal: the user would just be prompted for the next input before he sees the output from the last computation. – chqrlie Dec 07 '15 at 23:05
44

Another trick to make iostreams faster when you use both cin and cout is to call

cin.tie(nullptr);

By default, when you input anything from cin, it flushes cout. It can significantly harm performance if you do interleaved input and output. This is done for the command line interface uses, where you show some prompt and then wait for data:

std::string name;
cout << "Enter your name:";
cin >> name;

In this case you want to make sure the prompt is actually shown before you start waiting for input. With the line above you break that tie, cin and cout become independent.

Since C++11, one more way to achieve better performance with iostreams is to use std::getline together with std::stoi, like this:

std::string line;
for (int i = 0; i < n && std::getline(std::cin, line); ++i)
{
    int x = std::stoi(line);
}

This way can come close to C-style in performance, or even surpass scanf. Using getchar and especially getchar_unlocked together with handwritten parsing still provides better performance.

PS. I have written a post comparing several ways to input numbers in C++, useful for online judges, but it is only in Russian, sorry. The code samples and the final table, however, should be understandable.

Nikko
  • 4,182
  • 1
  • 26
  • 44
Ilya Popov
  • 3,765
  • 1
  • 17
  • 30
  • 1
    Thank you for the explanation and +1 for the solution, but your proposed alternative with `std::readline` and `std::stoi` is not functionally equivalent to the OPs code. Both `cin >> x;` and `scanf("%f", &x);` accept ant whitespace as separator, there can be multiple numbers on the same line. – chqrlie Dec 07 '15 at 07:14
27

The problem is that, quoting cppreference:

any input from std::cin, output to std::cerr, or program termination forces a call to std::cout.flush()

This is easy to test: if you replace

cin >> fact_num;

with

scanf("%d", &fact_num);

and same for cin >> num_of_inputs but keep cout you'll get pretty much the same performance in your C++ version (or, rather IOStream version) as in C one:

enter image description here

The same happens if you keep cin but replace

cout << num_of_trailing_zeros << "\n";

with

printf("%d", num_of_trailing_zeros);
printf("%s","\n");

A simple solution is to untie cout and cin as mentioned by Ilya Popov:

cin.tie(nullptr);

Standard library implementations are allowed to omit the call to flush in certain cases, but not always. Here's a quote from C++14 27.7.2.1.3 (thanks to chqrlie):

Class basic_istream::sentry : First, if is.tie() is not a null pointer, the function calls is.tie()->flush() to synchronize the output sequence with any associated external C stream. Except that this call can be suppressed if the put area of is.tie() is empty. Further an implementation is allowed to defer the call to flush until a call of is.rdbuf()->underflow() occurs. If no such call occurs before the sentry object is destroyed, the call to flush may be eliminated entirely.

vitaut
  • 49,672
  • 25
  • 199
  • 336
  • Thanks for the explanation. Yet quoting C++14 27.7.2.1.3: *Class basic_istream::sentry* : *First, if `is.tie()` is not a null pointer, the function calls `is.tie()->flush()` to synchronize the output sequence with any associated external C stream. Except that this call can be suppressed if the put area of `is.tie()` is empty. Further an implementation is allowed to defer the call to flush until a call of `is.rdbuf()->underflow()` occurs. If no such call occurs before the sentry object is destroyed, the call to flush may be eliminated entirely.* – chqrlie Dec 07 '15 at 06:40
  • As usual with C++, things are more complex than they look. The OP's C++ library is not as efficient as the Standard allows to be. – chqrlie Dec 07 '15 at 06:40
  • Thanks for the cppreference link. I don't like the "wrong answer" in the print screen though ☺ – Alex Lop. Dec 07 '15 at 06:46
  • @AlexLop. Oops, fixed the "wrong answer" issue =). Forgot to update the other cin (this doesn't affect the timing though). – vitaut Dec 07 '15 at 14:38
  • @chqrlie Right, but even in the underflow case performance is likely to be worse compared to the stdio solution. Thanks for the standard ref. – vitaut Dec 07 '15 at 14:41