6

Question

I came across a leetcode question gas-station

However I found my code is slightly faster if I use if/else rather than if/continue.
Edit: after twiddling with the test case. I found the problem is continue
It seems that clang is taking continue seriously..? My guess is hat clang somehow tries to put the part before continue and loop checking together.

In short: I want to know:

  • How do I "persuade" clang to understand my implementation about continue like if/else? Because I don't want to change my style of coding.

Edited sidenote: branch-prediction tag is removed.

Relative question

Should I use return/continue statement instead of if-else? which seems that they should have nearly no difference.

Investigation

Edit: the demo shows that the problem is continue

I used a 10k data set to test the code and found that the problem is continue itself.
I put continue in the if's end of body and then the runtime is increased by 30% ~ 50% which is still surprising to me.
Live Demo
Difference in number: 11070 to 15060(w/ continue)

Edited on 2022/5/30: Sorry I found that my previous naive profiling mistakenly got optimized out of part of calculation... which causes the difference in order of 100... Bad demo

Profiling code

#include <iostream>
#include <vector>
#include <chrono>
#include <cstdlib>

using namespace std; // Yeah I know it's bad.
// input , output
class Solution {
public:
    int canCompleteCircuit(vector<int>& gas, vector<int>& cost) {
        if(gas.empty()){
            return -1;
        }
        int len = gas.size();
        // start at the last station, move start backward if we cannot finished the loop
        int start = len - 1;
        int tank = gas[start] - cost[start];
        int cur = 0; // from first station
        while(start != cur){ // other condition?
            if(tank < 0){
                // if not, move start backward, check the value
                // continue...start > 0;
                start--;
                tank += gas[start] - cost[start];
                // continue;
            }
            else{
                tank += gas[cur] - cost[cur];
                cur++;   
            }
        }
        //cout << tank ;
        if(tank < 0){
            return -1;    
        }
        return start;
    }
};
volatile int tryNoToBeOptimizedOut = 0;    
int main() {

    Solution s;
    vector<int> gas{...}; // 10k data here
    vector<int> cost{...}; 
    int count = 100; 
    vector<int> results;
    for(int i = 0; i <count; i++){
        // size_t rIndex = rand()%gas.size();
        // gas[rIndex] += i;
        gas[i] += i;
        auto start = std::chrono::system_clock::now();
        int ret = s.canCompleteCircuit(gas,cost);
        auto end = std::chrono::system_clock::now();
        auto elapsed = end - start;
        std::cout << elapsed.count() << '\n';
        tryNoToBeOptimizedOut = ret;
        results.push_back(ret);
    }   
    tryNoToBeOptimizedOut = results[rand()%results.size()];
   
    return 0;
}

old tests before edit

In the first place I checked the implementation in gcc and found that both versions are nearly identical in assembly.
Live Demo in gcc

I know that Leetcode uses clang 11 with O2. So I checked the difference in clang 11.
Live Demo in clang 11
Live Demo in clang 14 doesn't get better

The results are a bit different between them. It seems that cur++; is combined into while predicates in if/continue version.
I'm not sure why....Is it related to optimization policies?

My code:

    int canCompleteCircuit(vector<int>& gas, vector<int>& cost) {
        if(gas.empty()){
            return -1;
        }
        int len = gas.size();
        // start at the last station, move start backward if we cannot finished the loop
        int start = len - 1;
        int tank = gas[start] - cost[start];
        int cur = 0; // from first station
        while(start != cur){ // other condition?
            if(tank < 0){
               // if not, move start backward, check the value
                // continue...start > 0;
                start--;
                tank += gas[start] - cost[start];
                continue;
            }
            // else{
                tank += gas[cur] - cost[cur];
                cur++;   
                // continue; <-- If I put another continue there then the runtime is similar to if/else version. But is there a better way than adding conitnue manually?
            // }
        }
        if(tank < 0){
            return -1;    
        }
        return start;
    }

Assembly in if/else version

.LBB0_5:                                #   in Loop: Header=BB0_3 Depth=1
        movsxd  rcx, ecx                #   `else` part
        mov     edx, dword ptr [r14 + 4*rcx]
        sub     edx, dword ptr [rbx + 4*rcx]
        add     ecx, 1
.LBB0_6:                                #   in Loop: Header=BB0_3 Depth=1
        add     eax, edx
        mov     edx, eax
        shr     edx, 31
        cmp     esi, ecx                # `while(start != cur){`
        je      .LBB0_7
.LBB0_3:                                # =>This Inner Loop Header: Depth=1
        test    dl, 1                   # `if(tank < 0){`
        je      .LBB0_5                 # go to else
        movsxd  rdi, esi
        add     esi, -1
        mov     edx, dword ptr [r14 + 4*rdi - 4]
        sub     edx, dword ptr [rbx + 4*rdi - 4]
        jmp     .LBB0_6

Assembly in if/continue version

.LBB0_3:                                # =>This Loop Header: Depth=1
        mov     r9, rdx
        mov     edx, r10d
        sub     edx, r9d
        add     rdx, 4
        movsxd  rdi, r9d
        lea     rsi, [r12 + 4*rdi]
        lea     rbx, [r14 + 4*rdi]
        xor     edi, edi
.LBB0_4:                                #   Parent Loop BB0_3 Depth=1
        test    cl, 1                   # `if` part
        jne     .LBB0_5
        add     eax, dword ptr [rbx + 4*rdi]  # after `if` statement
        sub     eax, dword ptr [rsi + 4*rdi]
        mov     ecx, eax
        shr     ecx, 31
        add     rdi, 1
        cmp     edx, edi
        jne     .LBB0_4
        jmp     .LBB0_7
.LBB0_5:                                #   in Loop: Header=BB0_3 Depth=1
        add     eax, dword ptr [r14 + 4*r8 - 4]    # Body of `if`
        sub     eax, dword ptr [r12 + 4*r8 - 4]
        add     r8, -1
        mov     esi, r10d
        sub     esi, r9d
        add     esi, 3
        mov     ecx, eax
        shr     ecx, 31
        movsxd  rdx, r9d
        add     rdx, rdi
        add     r10d, -1
        cmp     esi, edi
        jne     .LBB0_3
Louis Go
  • 2,213
  • 2
  • 16
  • 29
  • 1
    Have you run this in a profiler and checked the branch prediction stats? – Goswin von Brederlow May 29 '22 at 03:00
  • @GoswinvonBrederlow Sorry, I didn't do that. Also I didn't get all the test cases on leetcode. Do you mean that my assembly result depends on the input? – Louis Go May 29 '22 at 03:02
  • clang is an ahead-of-time C++ compiler. The asm depends only on the source code, *unless* you enable profile-guided optimization. In that case then yes, the compiler will know which branches are likely/unlikely and optimize accordingly, and that can depend on the input. Otherwise, the compiler wasn't involved in any step after the first run of the program. – Peter Cordes May 29 '22 at 03:24
  • 2
    It does sometimes happen that different ways of writing equivalent logic end up leading the compiler to make different asm. Either because of missed optimizations, or because of different heuristics for it guessing which path of execution would be more likely. (Perhaps leading to some other missed opt.) If something is performance critical, and you're sure you know which way a branch will usually go, `__builtin_expect` can sometimes help: [How do the likely/unlikely macros in the Linux kernel work and what is their benefit?](https://stackoverflow.com/a/31133787) – Peter Cordes May 29 '22 at 03:25
  • 3
    This doesn't look like anything inherently linked to if/else vs if/continue to me, but just that re-rolling the compiler dice worked out poorly for the second case and Clang randomly got more confused and decides to recompute `start` from scratch (most of the stuff in `LBB0_5` of the second snippet does that) every time it decrements it. – harold May 29 '22 at 03:33
  • @harold Thanks to your hint. I found that If I put another `continue` in the end of while loop. Then the runtime would be similar to `if/else` one. But I still think it's related to `if/continue` since this kind of scenario occurs most of time when I use `if/continue`. I have found this phenomenon on multiple leetcode question instead of just this one. – Louis Go May 29 '22 at 03:40
  • You tagged it branch-prediction. So you better measure the prediction. – Goswin von Brederlow May 29 '22 at 04:10
  • 1
    Never ever trust the timing output from leetcode. (in fact it is a very bad resource for learning how to write good and/or optimized c++ code). The system is shared with a lot of users so the deviation from average is pretty big. To optimize C++ use a local compiler and enable profiling. For if statements make them predicatable that will get you the best speed optimization. – Pepijn Kramer May 29 '22 at 06:30
  • 3
    This doesn't answer the question, but you can make this branchless by using `auto q = tank < 0 ? start-- : cur++; tank += gas[q] - cost[q];` – Aki Suihkonen May 29 '22 at 08:06
  • 2
    @AkiSuihkonen: Interesting idea, yeah that does compile branchlessly using `sar reg,31` to materialize a 0 or `-1`. https://godbolt.org/z/frW5Mbha6 (right-hand pane). Pretty sure it's doing something silly like `(~tank)>>31` instead of materializing the opposite value with a `not` on the shift result, though. Anyway, probably a lot faster if frequent backtracking is required so it doesn't predict well, probably slower for inputs where it can just blaze through the `tank >= 0` cases without switching to the other loop or branch. – Peter Cordes May 29 '22 at 08:56
  • @AkiSuihkonen: I was hoping clang for AArch64 could use `csinc`, but no, `add` with `lsr` and `add` with `asr` to add 1 or add -1 respectively, so I guess that's at least as good. https://godbolt.org/z/dq9MrneYz – Peter Cordes May 29 '22 at 09:05
  • 2
    One can squeeze out a few instructions also with `auto q = tank < 0 ? 0 : start; --start; gas += tank >= 0; cost += tank >=0;` as in https://godbolt.org/z/8vdbvWEYe. Also one could subtract the vectors in advance ( or interleave them). – Aki Suihkonen May 29 '22 at 13:18

1 Answers1

1

It is not, in short, continue is not ideal in your code but it is not slower or faster than if and i don't think i can give you a concrete answer but i think it's some kind of ratio of branch size over the chance of same condition. Running this example the results i get are wild, from run to run it can change a massive amount; in the example i tried to mix up a bit the main loop in a way that the compiler wont try to compress everything inside main but it basically consists of your solution and mine with if/continue and with two loop test.

I added the <type>_loop() to compare how the loop would run without the condition and implementing something like a duff or a loop unroling (to me they are the same thing but writen in a slightly different way); somehow the "faster" version runs slower and i only suspect is a bandwidth problem just from the time waited

#include <iostream>
#include <iomanip>
#include <vector>
#include <chrono>
#include <cmath>

#include <sample.hpp>

int solution_a_if(std::vector<int>& gas, std::vector<int>& cost) {
    int start = gas.size() - 1;
    int tank = gas[start] - cost[start];
    int cur = 0;
    while(start != cur){ 
        if(tank < 0){
            start--;
            tank += gas[start] - cost[start];
        }
        else{
            tank += gas[cur] - cost[cur];
            cur++;   
        }
    }
    if(tank < 0){
        return -1;    
    }
    return start;
}
int solution_a_continue(std::vector<int>& gas, std::vector<int>& cost) {
    int start = gas.size() - 1;
    int tank = gas[start] - cost[start];
    int cur = 0;
    while(start != cur){ 
        if(tank < 0){
            start--;
            tank += gas[start] - cost[start];
            continue;
        }
        tank += gas[cur] - cost[cur];
        cur++;   
    }
    if(tank < 0){
        return -1;    
    }
    return start;
}

int solution_b_if(std::vector<int>& gas, std::vector<int>& cost) {
    const int * p_cost = cost.data();
    const int * p_gas = gas.data();
    const int size = (int)gas.size();

    int start = -1, step = 0, rem = 0, n = 0;

    for (; n < size; n++){
        step += p_gas[n] - p_cost[n];
        if(step < 0){
            rem  += step; 
            step ^= step;
            start = n;
        }
    }
    if(rem + step < 0){
        return -1;
    }
    return ++start;
}
int solution_b_continue(std::vector<int>& gas, std::vector<int>& cost) {
    const int * p_cost = cost.data();
    const int * p_gas = gas.data();
    const int size = (int)gas.size();

    int start = -1, step = 0, rem = 0, n = 0;

    for (; n < size; n++){
        step += p_gas[n] - p_cost[n];
        if(step >= 0){ 
            continue; 
        }
        rem  += step; 
        step ^= step;
        start = n;
    }
    if(rem + step < 0){
        return -1;
    }
    return ++start;
}

int thin_loop(std::vector<int>& gas, std::vector<int>& cost) {
    const int size = (int)gas.size();
    int sum = 0;
    for (int n = 0; n < size; n++){
        sum += gas[n] - cost[n];
    }
    return sum;
}
int thick_loop(std::vector<int>& gas, std::vector<int>& cost) {
    const int size = (int)gas.size();
    int sum = 0, n = 0;
    switch (size % 4){
        case 3: sum += gas[n] - cost[n]; n++;
        case 2: sum += gas[n] - cost[n]; n++;
        case 1: sum += gas[n] - cost[n]; n++;
    }
    for (; n < size; n += 4){
        sum += gas[n + 0] - cost[n + 0];
        sum += gas[n + 1] - cost[n + 1];
        sum += gas[n + 2] - cost[n + 2];
        sum += gas[n + 3] - cost[n + 3];
    }
    return sum;
}

std::chrono::steady_clock::time_point tmps;
std::chrono::duration<double> duration;
double time_foo(int ((*foo)(std::vector<int>&,std::vector<int>&))){
    tmps = std::chrono::steady_clock::now();
    {
        foo(gas, cost);
    }
    duration = std::chrono::steady_clock::now() - tmps;
    return duration.count();
}

void print_info(const char * str, double avr, double dev){
    int mili = 1000;  // nano -> micro
    std::cout << std::left << std::setw(15) << str << "->  (avr) " << 
    std::fixed << std::setprecision(8) << avr * mili << "  (dev) " << 
    std::fixed << std::setprecision(8) << dev * mili << std::endl;

}

int main(){
    int i, loops = 10;
    double * times = new double[loops * 6];

    for (i = 0; i < loops; i += 6){
        times[i + 0] = time_foo(solution_a_if);
        times[i + 1] = time_foo(solution_a_continue);
        times[i + 2] = time_foo(solution_b_if);
        times[i + 3] = time_foo(solution_b_continue);
        times[i + 4] = time_foo(thin_loop);
        times[i + 5] = time_foo(thick_loop);

        std::cout << "\r" << "Loop i -> " << i;
    }

    double avr [6] = { 0, 0, 0, 0, 0, 0 };
    double dev [6] = { 0, 0, 0, 0, 0, 0 };

    for (i = 0; i < loops; i += 6){
        avr[0] += times[i + 0];
        avr[1] += times[i + 1];
        avr[2] += times[i + 2];
        avr[3] += times[i + 3];
        avr[4] += times[i + 4];
        avr[5] += times[i + 5];
    }
    avr[0] /= loops; avr[1] /= loops;
    avr[2] /= loops; avr[3] /= loops;
    avr[4] /= loops; avr[5] /= loops;

    for (i = 0; i < loops; i += 6){
        dev[0] += std::pow(times[i + 0] - avr[0], 2);
        dev[1] += std::pow(times[i + 1] - avr[1], 2);
        dev[2] += std::pow(times[i + 2] - avr[2], 2);
        dev[3] += std::pow(times[i + 3] - avr[3], 2);
        dev[4] += std::pow(times[i + 4] - avr[4], 2);
        dev[5] += std::pow(times[i + 5] - avr[5], 2);
    }
    dev[0] = std::sqrt(dev[0] / (loops - 1));
    dev[1] = std::sqrt(dev[1] / (loops - 1));
    dev[2] = std::sqrt(dev[2] / (loops - 1));
    dev[3] = std::sqrt(dev[3] / (loops - 1));
    dev[4] = std::sqrt(dev[4] / (loops - 1));
    dev[5] = std::sqrt(dev[5] / (loops - 1));

    std::cout << std::endl << "Scale microseconds" << std::endl; 
    print_info("A (if)",       avr[0], dev[0]);
    print_info("A (continue)", avr[1], dev[1]);
    print_info("B (if)",       avr[2], dev[2]);
    print_info("B (continue)", avr[3], dev[3]);
    print_info("Thin loop",    avr[4], dev[4]);
    print_info("Thick loop",   avr[5], dev[5]);

    delete [] times;
    return 0;
}

I used the AMD profiler, this is the first time i used one of these and a least with this one is realy nice to use

Loop i -> 999996
Scale microseconds
A (if)         ->  (avr) 0.00141345  (dev) 0.00629380
A (continue)   ->  (avr) 0.00147324  (dev) 0.00349820
B (if)         ->  (avr) 0.00127486  (dev) 0.00452548
B (continue)   ->  (avr) 0.00102505  (dev) 0.00627695
Thin loop      ->  (avr) 0.00018223  (dev) 0.00130967
Thick loop     ->  (avr) 0.00043947  (dev) 0.00244409

run of 1m

Loop i -> 996
Scale microseconds
A (if)         ->  (avr) 0.00148050  (dev) 0.00339158
A (continue)   ->  (avr) 0.00152250  (dev) 0.00325213
B (if)         ->  (avr) 0.00138230  (dev) 0.00322943
B (continue)   ->  (avr) 0.00114570  (dev) 0.00315594
Thin loop      ->  (avr) 0.00019380  (dev) 0.00043278
Thick loop     ->  (avr) 0.00046620  (dev) 0.00100103

run of 1k

Small rant: The leetcode testing system is so bad that makes overwatch look balanced, just by rerunning my code the timing changed drastically (min->71ms, max->155ms; with 26 runs) and the memory usage is just a lie, this example runs with ~10mb and it is telling me that just the function is consuming close to 70mb

<sample.hpp> is defines the gas/cost arrays from that long as sample you gave

SrPanda
  • 854
  • 1
  • 5
  • 9
  • BTW, most people abbreviate average to "avg", not "avr". My first thought on skimming the answer was to wonder if you'd timed it on an AVR microcontroller! Then I looked at the code and saw you were using that as a variable name. – Peter Cordes Jun 05 '22 at 23:07
  • That is a coincidence, i just prefer even size chars for short length names. I would say its because english is not my first and the names look cleaner – SrPanda Jun 06 '22 at 00:30
  • Thanks a lot. I think I need to re-consider the test cases.. Maybe it's not enough for testing. Previously I thought the upperbound and lowerbound of runtime could represent the trend of runtime. – Louis Go Jun 06 '22 at 02:25