2

I've spent too many brain cycles on this over the last day.

I'm trying to come up with a set of bitwise operations that may re-implement the following condition:

uint8_t a, b;
uint8_t c, d;
uint8_t e, f;
...

bool result = (a == 0xff || a == b) && (c == 0xff || c == d) && (e == 0xff || e == f);

Code I'm looking at has four of these expressions, short-circuit &&ed together (as above).

I know this is an esoteric question, but the short-circuit nature of this and the timing of the above code in a tight loop makes the lack of predictable time a royal pain, and quite frankly, it seems to really suck on architectures where branch prediction isn't available, or so well implemented.

Is there such a beast that would be concise?

bvarner
  • 363
  • 3
  • 9
  • 1
    Unable to understand your question, could you please rephrase? – Rasmi Ranjan Nayak Aug 10 '20 at 17:49
  • 1
    What makes you think the compiler is not able to do its job correctly? This expression is crystal clear and I'd expect the compiler to generate the best assembly for it. Do you have data that shows otherwise? – Jeffrey Aug 10 '20 at 17:51
  • 5
    An ancient trick to avoid branching is to use `|` and `&` instead. – molbdnilo Aug 10 '20 at 17:51
  • 1
    FYI, `a ^ b` will return 0 if both bits are the same. – Thomas Matthews Aug 10 '20 at 17:52
  • 1
    This reads like it's from some contest/challenge/competitive coding/hacking site. Is it? If your goal is to learn C++, you won't learn anything there. In nearly all cases, like this one, the correct solution is based on a mathematical or a programming trick. And, of course, that contest/challenge/competitive coding/hacking site doesn't tell you what the trick is. If you're trying to learn C++, you won't learn anything from meaningless online contest sites [but only from a good C++ textbook](https://stackoverflow.com/questions/388242/the-definitive-c-book-guide-and-list). – Sam Varshavchik Aug 10 '20 at 17:59
  • 1
    To be clear, is there a reason you want it to be constant-time? – user253751 Aug 10 '20 at 18:03
  • So you don't need the short-circuit - but prefer a more constant execution time. Is that it? – Support Ukraine Aug 10 '20 at 18:32
  • So this is not from some competition / contest / challenge site. In profiling an application, this code came up as eating up an excessive amount of time, negatively impacting performance. (flame graph was pretty telling) So I'm trying to find some way to speed up the comparison, that won't dramatically alter the code. It's expensive function with a disproportionate number of samples, so it seems like an outlier begging to be optimized. It's probably more of an issue of needing a map where there's a vector. In my mind, doing this is the first step to determining if we can hash for that. – bvarner Aug 10 '20 at 18:40
  • 1
    So for 8-bit, the 1s compliment of `0xff` is `0` as does `a-b`. Probably could `|` the results and check 0? `!(~a | (a-b)) | ...)` for each pair. Don't know if it is faster, but it might be more constant depending on the architecture. Honestly though, if this arithmetic code is your bottleneck, you must be really really calling this code a ton. Any way to cache results and call less? – Michael Dorgan Aug 10 '20 at 18:41
  • 1
    I'd also ask which exact asm instructions are the flamegraph hotspots from perf. If it's the first branch or two, then the early out is working. If it is later, consider reorganizing the compares so that the most likely to fail is first in the list of compares and therefore the early out happens as soon as possible. And use, all compiler help like O2 or O3, LTO, PGO , etc. if possible... – Michael Dorgan Aug 10 '20 at 18:46
  • Toss it all into an array then iterate across that one in a loop. Then optimize further from there if needed. – Lundin Aug 11 '20 at 07:02

3 Answers3

1

So, if you really want to do bit-twiddling to make this "fast" (which you really should only do after profiling your code to make sure this is a bottleneck), what you want to do is vectorize this by packing all the values together into a wider word so you can do all the comparisons at once (one instruction), and then extract the answer from a few bits.

There are a few tricks to this. To compare two value for equality, you can xor (^) them and test to see if the result is zero. To test a field of a wider word to see if it is zero, you can 'pack' it with a 1 bit above, then subtract one and see if the extra bit you added is still 1 -- if it is now 0, the value of the field was zero.

Putting all this together, you want to do 6 8-bit compares at once. You can pack these values into 9 bit fields in a 64-bit word (9 bits to get that extra 1 guard bit your going to test for subtraction). You can fit up to 7 such 9 bit fields in a 64 bit int, so no problem

// pack 6 9-bit values into a word
#define VEC6x9(A,B,C,D,E,F)  (((uint64_t)(A) << 45) | ((uint64_t)(B) << 36) | ((uint64_t)(C) << 27) | ((uint64_t)(D) << 18) | ((uint64_t)(E) << 9) | (uint64_t)(F))

// the two values to compare
uint64_t v1 = VEC6x9(a, a, c, c, e, e);
uint64_t v2 = VEC6x9(b, 0xff, d, 0xff, f, 0xff);
uint64_t guard_bits = VEC6x9(0x100, 0x100, 0x100, 0x100, 0x100, 0x100);
uint64_t ones = VEC6x9(1, 1, 1, 1, 1, 1);
uint64_t alt_guard_bits = VEC6x9(0, 0x100, 0, 0x100, 0, 0x100);

// do the comparisons in parallel
uint64_t res_vec = ((v1 ^ v2) | guard_bits) - ones;

// mask off the bits we'll ignore (optional for clarity, not needed for correctness)
res_vec &= ~guard_bits;

// do the 3 OR ops in parallel
res_vec &= res_vec >> 9;

// get the result
bool result = (res_vec & alt_guard_bits) == 0;

The ORs and ANDs at the end are 'backwards' becuase the result bit for each comparison is 0 if the comparison was true (values were equal) and 1 if it was false (values were not equal.)

All of the above is mostly of interest if you are writing a compiler -- its how you end up implementing a vector comparison -- and it may well be the case that a vectorizing compiler will do it all for you automatically.

This can be much more efficient if you can arrange to have your initial values pre-packed into vectors. This may in turn influence your choice of data structures and allowable values -- if you arrange for your values to be 7-bit or 15-bit (instead of 8-bit) they may pack nicer when you add the guard bits...

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • I'm going to try this out. I had already come up with packing as an option but got hung up on trying to do the comparisons between the two acceptable values as separate instructions, rather than packing both options into the same value for comparison. Let me play around with this and see if it does the trick. – bvarner Aug 12 '20 at 01:37
  • Could you elaborate on the part about doing the 3 OR ops in parallel? I'm not following how a right-shift by 9 bits is doing.. oh, it's an AND against the 9 right most bits. Got it. – bvarner Aug 12 '20 at 02:17
  • @bvarner: All that we care about are the 6 (single bit) comparison results, which are in every 9th (bits 8, 17, 26, 35, 44, and 53). We want (8||17), (26||35) and (44||53), so the shift and and gives us those 3 results in parallel. It also touches a bunch of other bits, but we don't care about those -- just these 3 result, which the last check looks at, – Chris Dodd Aug 12 '20 at 22:20
0

You could modify how you store and interpret the data:

When a if 0xFF, do you need the value of b. If not, then make b equal to 0xFF and simplify the expression by removing the part that test for 0xFF.

Also, you might combine a, b and c in a single variable.

uint32_t abc;
uint32_t def;

bool result = abc == def;

Other operations might be slower but that loop should be much faster (single comparison instead of up to 6 comparisons).

You might want to use an union to be able to access byte individually or in group. In that case, make sure that the forth byte is always 0.

Phil1970
  • 2,605
  • 2
  • 14
  • 15
0

To remove timing variations with &&, ||, use &, |. @molbdnilo. Possible faster, maybe not. Certainly easier to parallel.

// bool result = (a == 0xff || a == b) && (c == 0xff || c == d) 
//     && (e == 0xff || e == f);
bool result = ((a == 0xff) | (a == b)) & ((c == 0xff) | (c == d))
    & ((e == 0xff) | (e == f));
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256