Quickly generating the "triangle sequence": avoiding mispredictions

Question

I'm interested in calculating the triangle sequence¹, which is the sequence of pairs (i, j): (0, 0), (1, 0), (1, 1), (2, 0), (2, 1) ... which iterates though all pairs (i, j) with the restriction that i >= j. The same sequence with but with the restriction i > j is also interesting.

These sequences represent, among others things, all the ways to choose 2 (possibly identical) elements from a n-element set (for the sequence up to (n, n)²), or the indices of the lower triagular elements of a matrix³. The sequence of values for i alone is A003056 in OEIS, while j alone is A002262. The sequence frequently arises in combinartorial algorithms, where their performance may be critical.

A simple but branchy way to generate the next value in the sequence is:

if (i == j) {
    j = 0;
    i++;
  } else {
    j++;
  }      
}

However, this suffers from many mispredicts while calculating the initial elements of the sequence, when checking the condition (i == j) - generally one mispredict each time i is incremented. As the sequence increases, the number of mispredicts becomes lower since i is incremented with reduced frequency, so the j++ branch dominates and is well predicted. Still, some types of combinatorial search repeatedly iterate over the small terms in the sequence, so I'm looking for a branch-free approach or some other approach that suffers fewer mispredicts.

For many uses, the order of the sequences isn't as important, so generating the values in differnet order than above is a allowable if it leads to a better solution. For example, j could count down rather than up: (0, 0), (1, 1), (1, 0), (2, 2), (2, 1), ....

¹ I'm also interested in knowing what the right name for this sequence is (perhaps so I make a better title for this question). I just kind of made up "triangle sequence".

² Here, the i >= j version represents sub-multisets (repetition allowed), while the i > j variant represents normal subsets (no repetition).

³ Here, the i >= j version includes the main diagonal, while the i > j variant excludes it.

Why don't you just unroll the small-n cases? And if you're doing more than a few instructions of application code with each pair, won't that be dominant? — Mike Dunlavey, Jul 27 '16 at 15:16
Yes, on the "other work could dominate" point - but in at least one key scenario, the amount of work is only a few instructions so the triangle overhead is a big part. — BeeOnRope, Jul 28 '16 at 01:34
The problem with unrolling is that the actual code isn't in a loop (ie using external iteration) but instead used in the indentation of an iterator. Also, the limit isn't fixed, so I don't know at compile time whether I'm in the small n or big n case, and even the different small n cases need different unrolling. — BeeOnRope, Jul 28 '16 at 01:37
Nevertheless, in the key scenario, what percent of time will be spent doing this conditional incrementing? I only ask because sometimes programmers narrow in on one thing, which is like looking under the street light. — Mike Dunlavey, Jul 28 '16 at 12:06

Evgeny Kluev · Accepted Answer · 2016-07-27T15:10:26.063

Here are two branch-free approaches that do not use any expensive calculations. First one uses comparison and logical AND:

const bool eq = i == j;
i += eq;
j = (j + 1) & (eq - 1);

Second one uses comparison and multiplication:

const bool eq = i == j;
i += eq;
j = (j + 1) * (1 - eq);

In theory "multiplication" variant should be slower than "logical" one, but measurements show very little difference.

Both approaches would result in branchless code only for processors that allow branchless comparisons (for example x86). Also these approaches assume to be implemented using a language where results of conditional expressions could be easily converted to integers (for example C/C++, where "false" comparisons are converted to zero integers, and "true" ones - to integers equal to "1").

The only problem with these approaches is performance. They could in theory outperform branchy code, but only when mispredicts are really frequent. A simple test where there is no other work besides generating "triangle sequence" (see it on ideone) shows miserable mispredict rate and therefore both branchless methods about 3 times slower than branchy one. The explanation is simple: there should be not much mispredicts for longer sequences; as for shorter ones, modern processors have very good branch predictors that almost never fail in case of short branch patterns; so we have not many mispredicts, branchy code almost always executes only 2 instructions (compare, increment), while branchless code executes both active and incative "branches" plus some instructions specific to branchless approach.

In case you want to repeatedly iterate over the small terms in the sequence, probably other approach would be preferable. You calculate the sequence only once, then repeatedly read it from memory.

Excellent answer, I will digest it in more detail soon. You are right about mispredicts - but I had neglected the part where the shorter sequences being effectively memorized by the predictors. The newer Intel CPUs (somewhere around Haswell) got a way better predictor that can memorize much longer sequences of branches than prior generations. — BeeOnRope, Jul 28 '16 at 00:46
My target languages are actually Java and C++. The former doesn't have the implicit conversion of boolean to `0/1` that your code uses, but if you write out the conversion with the ternary operator the compiler is often smart enough to make a branch free version. — BeeOnRope, Jul 28 '16 at 00:47

cdlane · Answer 2 · 2016-07-31T05:03:28.950

In Python we can express this as:

i, j = i + (i == j), (j + 1) * (i != j)

but it turns out, at around a million iterations or so, on my machine, the following, more long winded, lazy evaluation code is about 20% faster:

from itertools import count, repeat

def gen_i():
    """ A003056 """
    for x in count(0):  # infinitely counts up
        yield from repeat(x, x + 1)  # replication

def gen_j():
    """ A002262 """
    for x in count(0):  # infinitely counts up
        yield from range(x + 1)  # count up to (including) x

sequence = zip(gen_i(), gen_j())

for _ in range(1000000):
    i, j = next(sequence)

In the above code, gen_i(), gen_j(), count(), repeat(), and zip() are all generators (and range() is an iterator) so sequence continues to call into the code on demand as new (i, j) pairs are required. I assume both the implementation of range() and repeat() terminate with a misprediction.

Simple isn't necessarily also quick (i.e. consider all the unnecessary additions of zero and multiplictions by one in the compact form.)

So which is more important, quickly generating the sequence or avoiding mispredictions?

Andreas · Answer 3 · 2016-07-10T14:11:12.247

0

You can derive j from i:

...set val...
old_j = j;
j = (j + 1) % (i + 1);
if (i == old_j) {
    i++;
}
...loop if more...

And further derive i increment from j and current i:

...set val...
old_j = j;
j = (j + 1) % (i + 1);
i = i + (i / old_j);
...loop if more...

(Can't test it at the moment... Please review)

edited Jul 10 '16 at 14:11

answered Jul 10 '16 at 14:02

Andreas

5,086
3
16
36

The first idea is essentially the same as the snipped I posed above, but with the conditional increment of `j` replaced by the mod operation. However, the conditional still remains for the `i++` case, so there are as many branches as before and as many mispredicts. Basically removing the `else` condition of a poorly predicted `if-else` condition doesn't really help since the number or branches and the branch behavior doesn't really change. – BeeOnRope Jul 12 '16 at 18:17
The second idea is indeed fully brach-free, but unfortunately the `/` and `%` operators are in general about as slow as a mispredict, so this solution is likely to be slower, except on very unusual hardware. I will modify the question to make it clearer that performance is the primary concern here. – BeeOnRope Jul 12 '16 at 18:19
1

@BeeOnRope It was intended a single idea explained in two steps. I was not sure about your platform. Some GPU language standards lack backwards branching all together because of increased hardware complexity. Ok, it is performance you want... – Andreas Jul 12 '16 at 18:46
Ah, sorry, it makes sense now. Yeah the `/` and `%` approaches can work, especially when n happens to be a power of 2 and so `>>` and `&` can be substituted. In my case though I'm targeting CPUs (don't GPUs have poor div and mod capabilities as well?) and performance is the goal. – BeeOnRope Jul 12 '16 at 18:50
@BeeOnRope Performance of integer divide varies by platform. For PPC e500mc it is only 4 cycles when result is 0 (which would be the common case for this problem), and increased cycles as numbers get larger. Divide also use a different (more complex) execution unit, allowing other instructions (such as integer add and branch conditionally) to execute concurrently. Again, platform dependent. What is your "general platform"? I'd be happy to dig into documentation for you... – Andreas Jul 12 '16 at 18:58
That's a good point about the zero-case. I'm interested especially in x86-64, but other "common" platforms such as recent ARM variants are interesting too. At least on Intel even the best-case divide is, AFAIK, 10+ cycles on recent chips. – BeeOnRope Jul 12 '16 at 19:00

Quickly generating the "triangle sequence": avoiding mispredictions

3 Answers3