(C++) Leetcode: why is my code so much slower than sample top solution? (547. Number of Provinces)

Question

https://leetcode.com/problems/number-of-provinces/

I was pretty excited when I solved this problem on my very first try, within only 20/30 minutes, though when I submitted my code, I ended up in 8.43 percentile. I looked at how the fastest solutions approached the problem, and low and behold, the sample top solution is nearly identical to my code, yet it runs 3x faster. I've been comparing the code and can't really point out a substantial enough difference. Both should be equally fast... Can anyone explain the why? If I'm not mistaken it's O(mn) performance in both cases.

The following is my code. It's pretty self-explanatory, so not sure heavy commenting would do any good.

class Solution {
public:
    int findCircleNum(vector<vector<int>>& isConnected) {
        int components = 0;
        vector<bool> visited (isConnected.size(), false);
        
        // go through each row
        for (int i = 0; i < isConnected.size(); i++) {
            // explore only unvisited items
            if (!visited[i]) {
                queue<int> q;
                
                q.push(i);
                components++;
            
                while (!q.empty()) {
                    int node = q.front();
                    q.pop();
                    visited[node] = true;
                    
                    // push all direct connections onto the queue so we explore them
                    for (int j = 0; j < isConnected[0].size(); j++) {
                        if (isConnected[node][j] == 1 && !visited[j]) {
                            q.push(j);
                        }
                    }
                }
            }
        }
        
        return components;
    }
};

and the following is a sample top solution that runs 3x faster than my code.

class Solution {
public:
    int findCircleNum(vector<vector<int>>& M) {
        if (M.empty()) {
            return 0;
        }
        int count = 0;
        vector<bool> visited(M.size());
        auto bfs = [&](int student) {
            queue<int> q;
            q.push(student);
            visited[student] = true;
                    
            while (!q.empty()) {
                auto current = q.front();
                cout << "current " << current << endl;
                q.pop();
                        
                for (int i = 0; i < M.size(); i++) {
                    if (M[current][i] == 1 and !visited[i]) {
                        visited[i] = true;
                        q.push(i);
                    }
                }
            }
        };
        for (int r = 0; r < M.size(); r++) {
                if (visited[r] == false) {
                    count++;
                    bfs(r);
                }
        }
        return count;
    }
};

Seems like even the allegedly "top" solution is blissfully unaware of what `std::vector` really is, so in terms of performance even the allegedly "top" solution must still be much slower than it has to be. This only goes to show how useless are all these leetcode-site sites, in terms of improving one's C++ knowledge and understanding, and how much of a waste of time they are. — Sam Varshavchik, Jan 01 '21 at 23:16
I mean exactly what I mean: sites like leetcode are just a meaningless list of programming puzzles, and there's nothing there that attempts to teach or explain the underlying principles needed to solve those puzzles. So, if one hopes to actually learn something useful, regarding C++, or some other programming language, they'll be sorely disappointed. Only a good C++ textbook will explain why an unfortunate historical accident made `std::vector` what it is (hint: it's not an array of `bool`s), and why it should never be used in actual code. — Sam Varshavchik, Jan 01 '21 at 23:24
@Yolomep: My guess would be that he's heard rumors about `std::vector` being slow, but doesn't really know enough about it to comment meaningfully. It (usually) packs each bool into a single bit, so reading or writing a value requires some bit masking and such. What he's missed is that this improves cache usage (by a factor of about 8 in a typical case) so if the array is at all large, the improved cache usage matters more than the bit-masking, so it wins (usually by a pretty wide margin). For those who actually care: https://stackoverflow.com/a/16738887/179910 — Jerry Coffin, Jan 01 '21 at 23:25
The "top solution" is using a lambda for no apparent reason. It appears to be functionally equivalent to just pasting the code into the loop. Leetcode is a waste of your time. We specifically avoid asking questions like those you'll find there. We ask real engineering questions, typically related to some problem we're facing, not something that one can google. (If that worked, we'd just google the damn thing ourselves and save a couple of hundred grand.) — 3Dave, Jan 01 '21 at 23:27
I am just spitballing here, but the usage of the constructor vector(size_t, bool) is iterating through the vector once to set the values to false. the other contructor doesn't, which could be the difference — Midnight Exigent, Jan 01 '21 at 23:30
@jbztt Don't guess. Look at the spec or a common implementation. Assumptions cause startups to crater. — 3Dave, Jan 01 '21 at 23:30
@3Dave Out of 3 job application responses I received so far, every single one of them wanted me to solve a variation of these problems. I get what you mean, but unfortunately that's not the reality; though it does depend on which jobs are you applying to. — haosmark, Jan 01 '21 at 23:32
@jbztt it's an O(N) operation, which doesn't matter that much here. — haosmark, Jan 01 '21 at 23:34
@haosmark If you want to be a cog, go get one of those jobs. If you want to be an engineer, learn how these things work. Personally: if I was asked one of these questions, I'd end the interview and thank them for their time. It says a lot about the caliber and quality of the people you'd be working with that *this* is what they've come up with. — 3Dave, Jan 01 '21 at 23:34
This thing is a basic tree search. It's not that complicated. The "top" solution uses a queue instead of explicit code recursion, but there's nothing new here. This is easy enough to research and explore on your own using ANY search engine and a debugger. This question doesn't belong here. — 3Dave, Jan 01 '21 at 23:35
@3Dave hmm, I agree with you that those questions are ridiculous but unfortunately it is the current mindset of the biggest tech companies including google and such. so until that changes, people have to go learn basic algorithms if they want to have a chance to pass the interview — Midnight Exigent, Jan 01 '21 at 23:37
@jbztt I work for one of the biggest tech companies. I do NOT ask these things in an interview. — 3Dave, Jan 01 '21 at 23:38
@haosmark: I don't remember if it was on leetcode, but I wrote some code on one of those sites to answer a question a bit like this some years ago. My guess is they get a lot that run at nearly the same speed, so even trivial changes in speed can make a large difference in placement. I saw pretty serious changes in placement (like 40 or 50 percentile points) simply by re-running the exact same code a few times. — Jerry Coffin, Jan 02 '21 at 00:26

score 0 · Answer 1 · answered Jan 02 '21 at 02:48

The difference is as far as I can [see][1] the placement of visited[i] = true;, which causes a few less memory access per iteration. Where the OP code needs to re-fetch the bool.

And there might be a data or control flow dependence between

visited[node] = true;

and

!visited[j]

That is not there in the Best code.

OP code inner loop

.L118:
        mov     rax, QWORD PTR [rsi+rbx]
        cmp     DWORD PTR [rax+rcx*4], 1
        jne     .L116

        mov     rax, rbp
        mov     r8, rcx
        sal     rax, cl
        mov     rcx, QWORD PTR [rsp+80]
        shr     r8, 6
        and     rax, QWORD PTR [rcx+r8*8]
        jne     .L116
        mov     rax, QWORD PTR [rsp+192]
        sub     rax, 4
        cmp     rdi, rax
        je      .L117

"Best" code

.L76:
        mov     rax, QWORD PTR [rsi+rbx]
        cmp     DWORD PTR [rax+rcx*4], 1
        jne     .L74

        mov     rax, QWORD PTR [r12]
        mov     rsi, rcx
        shr     rsi, 6
        mov     rax, QWORD PTR [rax]
        lea     rsi, [rax+rsi*8]
        mov     eax, 1
        sal     rax, cl
        mov     rcx, QWORD PTR [rsi]
        test    rcx, rax
        jne     .L74
        or      rax, rcx <------------ visited[i] = true;
        mov     QWORD PTR [rsi], rax
        mov     rax, QWORD PTR [rsp+96]
        sub     rax, 4
        cmp     r8, rax
        je      .L75
        mov     DWORD PTR [r8], edx
        add     r8, 4
        mov     QWORD PTR [rsp+80], r8
        jmp     .L74


  [1]: https://godbolt.org/z/obfqf7

(C++) Leetcode: why is my code so much slower than sample top solution? (547. Number of Provinces)

1 Answers1