Why is this C++ deque code 'less efficient' when I demodularize it?

Question

This is a problem I faced when studying solutions to this problem on HackerRank. It basically boils down to the following: given an array A and an integer K, the problem asks you to find the maximum of each contiguous subarray of A with size K. There is some additional stuff (for each test case, this problem occurs T times, and one must read from input to obtain A), but that's the gist of it.

The site checks whether your answer is sufficiently efficient, and even if your code produces the correct output in the end, the submission may fail due to timeout. Now, when your first get to the code area, there is some predefined code for you:

#include <iostream>
#include <deque> 
using namespace std;

void printKMax(int arr[], int n, int k){
    //Write your code here.
}

int main(){

    int t;
    cin >> t;
    while(t>0) {
        int n,k;
        cin >> n >> k;
        int i;
        int arr[n];
        for(i=0;i<n;i++)
            cin >> arr[i];
        printKMax(arr, n, k);
        t--;
    }
    return 0;
}

All fine and dandy, the site already handles input for your benefit, and points you towards a modular approach. All you have to do is actually solve the contiguous subarray problem; I did it with the following function, which passes all test cases:

void printKMax(int arr[], int n, int k){
    // Queue will store up to k elements (moving window)
    // Elements stored will be indices of the array. Indices allow to compare with loop iterator for removal of oldest element as window moves
    // Most recent elements are pushed to the back (so they take longer to be popped)
    std::deque<int> Q(k); 

    // Iterate over the first window
    int i; 
    for (i = 0; i < k; ++i) { 
        // A new element (arr[i]) to be added at the back invalidates all elements added before it that are no greater (if they are smaller, they cannot be a maximum; if they are equal, the new element is more recent). Those elements are popped
        // This guarantees that the values corresponding to Q's indices will be increasing (strictly) from back to front. In particular, the front will be the maximum
        while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
            Q.pop_back();
        Q.push_back(i); 
    }

    // Iterate, letting the window move
    for (; i < n; ++i) { 
        // Print the largest element (window has not moved yet), followed by a space
        cout << arr[Q.front()] << " "; 

        // Use indices to determine if oldest element has gone out of the window
        if (Q.front() <= i - k) Q.pop_front();

        // Add new element, like before
        while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
            Q.pop_back(); 
        Q.push_back(i); 
    } 

    // Print the maximum element of last window, and end the line 
    cout << arr[Q.front()] << endl; 
}

However, looking back on it, I thought I could do (slightly?) better. For each test case, the input handling in main loops n times to fill the array arr, and then printKMax in turn also loops n times to create and slide the moving window. I thought I could combine these into a single loop, which I guess should be more efficient. I did it with the following code, which is basically the same as before but with printKMax incorporated into the main. I'll omit comments this time.

int main(){

    int t;
    cin >> t;
    while(t>0) {
        int i = 0, n, k;
        cin >> n >> k;
        int arr[n];

        std::deque<int> Q(k);

        for (; i < k; ++i) {
            cin >> arr[i];
            while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
                Q.pop_back();
            Q.push_back(i); 
        }

        for (; i < n; ++i) {
            cout << arr[Q.front()] << " "; 

            if (Q.front() <= i - k) Q.pop_front();

            cin >> arr[i];
            while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
                Q.pop_back(); 
            Q.push_back(i); 
        } 

        cout << arr[Q.front()] << endl;

        t--;
        }
    return 0;
}

This time, however, the code fails all but the simplest of test cases, in each case due to time limit constraints.

I understand that in practice modularization is good, but at this point my interest is more 'academic' than anything else. Is there something I'm missing about the actual program, or is this something related to how things are run behind the scenes that I'm unaware of?

EDIT: As suggested by PaulMcKenzie, I've used g++ to run both versions of the program, using one of the failed test cases as input. Both ran fine and produced the same output, but the one with printKMax ran in 3.97s while the one without ran in 5.44s. It takes about 37% longe.

Off-topic … This challenge gives you code that has `int arr[n];`? That's not even valid C++. — ChrisMM, Feb 19 '20 at 15:00
**Profiling** and a **debugger** are your friend to find the desired answers. — MrSmith42, Feb 19 '20 at 15:02
@ChrisMM, it's probably one of the rare cases where it is actually fine, if not preferable, to do so - the authors run the servers, they control the environment the program will be compiled and run in, they generate the test cases, etc. They _know_ it _will_ work. And gives the least amount possible of overhead (sure, the differences would probably be small, but still). — DeducibleSteak, Feb 19 '20 at 15:03
@freakish: the code as written guarantees that `Q.back() < i`, so `arr[Q.back()]` was already initialized. The problem is likely somewhere else in OPs code. — Yakov Galka, Feb 19 '20 at 15:05
Notice that the code runs fine on the first test case and attains the expected output. In all other test cases, the program fails due to timeout, not due to some compile error, segmentation fault, incorrect output or something else (which is pointed out when that's the case). — Fimpellizzeri, Feb 19 '20 at 15:07
@DeducibleSteak, I realize that, but it also teaches bad habits to people who use these services to learn how to code (which they shouldn't be). Presumably they would know the maximum size of `n`, and could just hard code the array's size, and only partially fill it as needed. — ChrisMM, Feb 19 '20 at 15:08
@Fimpellizieri there's `cin >> arr[i];` in your last snippet in the last loop which is not present in `printKMax`. That would explain the timeout. — freakish, Feb 19 '20 at 15:10
@freakish Yes, the idea was precisely to combine reading the input and moving the window int the same loop (instead of having to loop twice). — Fimpellizzeri, Feb 19 '20 at 15:11
I know [this question](https://stackoverflow.com/q/11241523/6427763) is about Python, but behind the scenes, I guess this could be related? — Fimpellizzeri, Feb 19 '20 at 15:12
@freakish Notice that there are two loops: the first runs through k to define the initial sliding window, the second runs from k onwards to actually slide the window. In total, that's n iterations. — Fimpellizzeri, Feb 19 '20 at 15:13
@Fimpellizieri oh, right, got ya. The code is a bit complicated. — freakish, Feb 19 '20 at 15:14
Maybe the calling framework expects your program to first read all the input before it starts output. — Henrik, Feb 19 '20 at 15:15
Why do you need this `arr` at all if you have `Q` and you want process data as they are read? Also placing whole code in `main` doesn't improve performance, but makes code hard to read that is why you have problems. — Marek R, Feb 19 '20 at 15:15
@MarekR I actually need the array to access and print the maximum element (rather than just its index). Like I said in the post, I understand that modularization is good, but I would like to understand **why** combining two n-iteration loops into a single one ends up less efficient. — Fimpellizzeri, Feb 19 '20 at 15:19
@Fimpellizieri If this is HackerRank, isn't there an option to see the test case that fails? If so, then take that test case, rewrite your program so that it runs on a real compiler (not an online one), and see for yourself what is happening. You're asking us to look at your code, and also to figure out the man behind the HackerRank curtain. Eliminate the need for the latter, and just work with the former. — PaulMcKenzie, Feb 19 '20 at 15:46
@PaulMcKenzie As per your suggestion, I've used g++ to run both versions of the program, using one of the failed test cases as input. Both ran fine and produced the same output, but the one with `printKMax` ran in 0.037s while the one without ran in 1.912s. It takes over 50 times longer! I've also added this as an edit to the body of the question. — Fimpellizzeri, Feb 19 '20 at 17:11
Note that there are some nice solutions to this problem here: https://stackoverflow.com/questions/57212532/make-time-complexity-as-on-for-this-problem/57213522#57213522 — Matt Timmermans, Feb 19 '20 at 17:27
@MattTimmermans Yes, this is basically what's implemented here! It was a very interesting algorithm. — Fimpellizzeri, Feb 19 '20 at 17:29
@Fimpellizieri -- Did you run it with full optimizations turned on when you built the code? Did you use the `-O2` or `-O3` option in the build? — PaulMcKenzie, Feb 19 '20 at 18:21
@Fimpellizieri, if any of the suggested io tweaks improve your 1.912s result, mention it here, I'd be interested in how bad it can make things in this case. — DeducibleSteak, Feb 19 '20 at 19:32

Matt Timmermans · Accepted Answer · 2020-02-19T17:59:45.913

5

You second version is slower, because you're interleaving your input and output.

Reading from cin causes any output buffered for cout to be flushed first. The idea is that there is a user that needs to see your prompt.

Since you switch between reading and writing many times for each test, it results in many I/O operations to write to the console, which is not so fast.

See: https://www.geeksforgeeks.org/fast-io-for-competitive-programming/ and https://en.cppreference.com/w/cpp/io/basic_ios/tie

you can fix it with cin.tie(NULL);

edited Feb 19 '20 at 17:59

answered Feb 19 '20 at 17:44

Matt Timmermans

53,709
3
46
87

This seems to be the case. Using `cin.tie(NULL)` alone already causes the code to pass the tests. Thank you for your answer and references, I didn't know I/O had this large of an impact. – Fimpellizzeri Feb 19 '20 at 21:23

Maxim Egorushkin · Answer 2 · 2020-02-19T20:53:08.817

3

One reason the "de-modularized" version is slow is because you have more I/O in the loop. From the C++ standard:

Reading an object designated by a volatile glvalue (6.10), modifying an object, calling a library I/O function, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.

After calling I/O function the code must re-load objects from memory (unless the object didn't have to be stored in memory at all, like local variables whose address hasn't leaked). Note, that any function whose definition isn't available at compile time is considered to call I/O functions.

To improve the de-modularized version performance, do not do any I/O in it. Rather store the results into another array int results[k]; and print that array later.

Another issue is that by default std::cin and std::cout are in sync_with_stdio:

In practice, this means that the synchronized C++ streams are unbuffered, and each I/O operation on a C++ stream is immediately applied to the corresponding C stream's buffer.

Try calling std::cin.sync_with_stdio(false); and std::cout.sync_with_stdio(false); - that makes C++ I/O much faster.

edited Feb 19 '20 at 20:53

answered Feb 19 '20 at 17:28

Maxim Egorushkin

131,725
17
180
271

Did you mean "the code must re-load all _non_-local variables from memory"? I'm pretty sure functions can keep their local variables in registers during a function call, if they are not exposed to the outside world in any way, no? – DeducibleSteak Feb 19 '20 at 17:36
@DeducibleSteak Nope, I meant _local_ variables. An I/O function can change the stack even if it doesn't take any pointers to stack variables. – Maxim Egorushkin Feb 19 '20 at 17:42
Have a look here: https://godbolt.org/z/nxNT5T , gcc 9.2 seems to disagree. It hard-codes the return value that was stored in a local int during a call to cin::operator>>() (source #1), and reads it from memory when the return value is held in an int in an unknown location through a reference, during the same call (source #2). – DeducibleSteak Feb 19 '20 at 17:57
@DeducibleSteak Your example demonstrates that it reloads the local variables from memory after an I/O call, see `mov eax, 123` and `mov eax, DWORD PTR [rbx]`. – Maxim Egorushkin Feb 19 '20 at 18:00
`mov eax, 123` simply puts the hardcoded value 123 into eax, as I stated. `mov eax, DWORD PTR [rbx]` read the value from memory, as I also stated. – DeducibleSteak Feb 19 '20 at 18:02
@DeducibleSteak It could do `mov eax, 123` instead of `mov eax, DWORD PTR [rbx]` but it cannot after I/O. The C++ standard says that I/O functions change execution environment. The stack is a part of that environment. – Maxim Egorushkin Feb 19 '20 at 18:04
Is the line just before `mov eax, 123`, the part that goes: `call std::basic_istream >::operator>>(int&)` not I/O? – DeducibleSteak Feb 19 '20 at 18:06
@DeducibleSteak Writing `int result = 123;` doesn't require or guarantee that `result` is stored on the stack. Here is another example for you: https://godbolt.org/z/ogMV4- – Maxim Egorushkin Feb 19 '20 at 18:09
So, when you said, about `int result = 123;` that it "It could do `mov eax, 123` instead of `mov eax, DWORD PTR [rbx]` but it cannot", you meant: "It could do `mov eax, 123`, and it does!"? But OK, I've had enough fun. – DeducibleSteak Feb 19 '20 at 18:32
Looky here, I've even managed to get `result` to have an address (`[rsp+12]`), call std::cout with a value read from that address, and still, when `result` gets used for the return value, the value is hard coded using `mov eax, 123`: https://godbolt.org/z/eGt9B6 – DeducibleSteak Feb 19 '20 at 19:22
@DeducibleSteak The generated assembly calls `call std::basic_ostream >::operator<<(int)` which takes `result` by value, so `result` doesn't need to be stored in memory at that point. Surprising that the entire array initialization isn't optimized out. – Maxim Egorushkin Feb 19 '20 at 20:50
1

Well, but it is - like I said, at address `[rsp+12]`. I see you've also changed your response - to correctly indicate which local variables need to be reloaded, and which do not. Good! Upvote from me! – DeducibleSteak Feb 19 '20 at 20:56

Why is this C++ deque code 'less efficient' when I demodularize it?

2 Answers2