0

I did this expriment https://en.algorithmica.org/hpc/pipelining/branching/

which is inspired by the most upvoted Stack Overflow question ever.

Why is processing a sorted array faster than processing an unsorted array?

however i got a symmetrical result on Apple M1 chip:

(while the experiment use AMD Zen2)

enter image description here

the code is

#include <bits/stdc++.h>

#ifndef N
#define N 1'000'000
#endif

#ifndef T
#define T 1e8
#endif

#ifndef P
#define P 50
#endif

const int K = T / N;
int a[N];

int main()
{
    for (int i = 0; i < N; i++)
        a[i] = rand() % 100;

#ifdef SORT
    std::sort(a, a + N);
#endif

    clock_t start = clock();
    volatile int s = 0;

    for (int k = 0; k < K; k++)
        for (int i = 0; i < N; i++)
#ifdef CMOV
            s += (a[i] < P ? a[i] : 0);
#else
            // if (__builtin_expect(a[i] < P, false))// [[unlikely]]
            if (a[i] < P)
                s += a[i];
#endif

    float seconds = float(clock() - start) / CLOCKS_PER_SEC;
    float per_element = 1e9 * seconds / K / N;

    printf("%.4f %.4f %.4f\n", seconds, per_element, 2 * per_element);
    printf("%d\n", s);

    return 0;
}

and the plot code with ipython:

import pickle
import matplotlib.pyplot as plt
import seaborn as sns

sns.reset_defaults()
sns.set_theme(style='whitegrid')

# benchmark
def bench(n=10**6, p=50, t=10**8, sort=False, cmov=False, unroll=False, cc='clang++'):
    res = !{cc} -std=c++17 -O3 -D N={n} -D T={t} -D P={p} {"-D CMOV" if cmov else ""} {"-D SORT" if sort else ""} {"-funroll-loops" if unroll else ""} branching.cc -o run && ./run
    print(res)
    return float(res[0].split()[-1])

ps = list(range(0, 101))
rs = [bench(p=p) for p in ps]

# plot
with open('ps.pkl', 'wb') as file:
    pickle.dump(rs, file)

with open('ps.pkl', 'rb') as file:
    rs = pickle.load(file)

plt.plot(ps, rs, c='darkred')

plt.xlabel('Branch probability (P)')
plt.ylabel('Cycles per iteration')

plt.title('for (int i = 0; i < N; i++) if (a[i] < P) s += a[i]', pad=12)

plt.ylim(bottom=0)
plt.margins(0)

fig = plt.gcf()
fig.savefig('branchy-vs-branchless.svg')
plt.show()

if the condition is true ,there would be more instructions for add to s so there should have a asymmetrical result.

why?

i tried to change !{cc} -std=c++17 -O3 to -O2 -O1 and -O0 but the result is still symmetrical

Add branchless experiment results:

enter image description here

  • check assembly. The cost of branch-miss is usually far larger then the cost of execution of few instruction produced from `s += a[i];`. It will likely be a single assembly instruction. Moreover, ARM cpus supports "conditional" instruction that are executed only if a specific bit in a status register is set/unset. – tstanisl May 12 '23 at 09:51
  • from what you wrote here it is not clear why you expect non-symmetric graph. You just say "if the condition .... so there should have a symmetrical result" you do get a symmetrical result. – 463035818_is_not_an_ai May 12 '23 at 09:53
  • 1
    Offtopic: In C++ do not use `#define` to handle constants. Use `const int ...` or better `constexpr int ....`. – Marek R May 12 '23 at 10:06
  • Do also experiment with: `s += (a[i] < P) * a[i];` (branchless). – Marek R May 12 '23 at 10:10
  • Here `#define` is used instead of `constexpr int ....` is because that it's convenient to pass the value and count time cost of different `p` – Bowen Smith May 12 '23 at 10:29
  • Also I did branchless experiment with `s += (a[i] < P) * a[i];` it has a const result around 1 with `-O3` flag on, added result pics above. – Bowen Smith May 12 '23 at 10:37
  • the instructions to add two integers are marginal compared to the cost of a branch prediction miss. Thats what the stackoverflow question and the article you link is all about – 463035818_is_not_an_ai May 12 '23 at 11:07
  • Why do you expect an asymmetric graph? At least in theory the branch predictor should be able to reduce the costs for the 100% and 0% hit case to the same values. The only difference is, if the memory fetch (a[i]) can be fast enough speculatively executed. It might be also cached in a register but that would require you to check the generated assembly code. – Alois Kraus May 18 '23 at 20:44

0 Answers0