I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.
In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.
Consider inline int bct_decoder( unsigned bctData )
function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them
First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:
inline int bct_decoder_cond( unsigned bctData ) {
unsigned lsB = bctData & 1;
unsigned msB = bctData >> 1;
return
( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
( lsB > msB ) ? 1 : -1;
}
This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)
Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:
inline int bct_decoder_mod( unsigned bctData ) {
return ( int )( ( bctData + 1 ) % 3 ) - 1;
}
Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:
inline int bct_decoder_math( unsigned bctData ) {
return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}
This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.
If this is not enough then we can go futher, having something special. Next is called lookup table approach:
inline int bct_decoder_lut( unsigned bctData ) {
static const int decoderLUT[] = { 0, 1, -1, 0 };
return decoderLUT[ bctData & 0x3 ];
}
In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND
instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.
So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.