Encoding 3 base-6 digits in 8 bits for unpacking performance

Question

I'm looking for an efficient-to-unpack (in terms of small number of basic ALU ops in the generated code) way of encoding 3 base-6 digits (i.e. 3 numbers in the range [0,5]) in 8 bits. Only one is needed at a time, so approaches that need to decode all three in order to access one are probably not good unless the cost of decoding all three is very low.

The obvious method is of course:

x = b%6;   //  8 insns
y = b/6%6; // 13 insns
z = b/36;  //  5 insns

The instruction counts are measured on x86_64 with gcc>=4.8 which knows how to avoid divs.

Another method (using a different encoding) is:

b *= 6
x = b>>8;
b &= 255;

b *= 6
y = b>>8;
b &= 255;

b *= 6
z = b>>8;

This encoding has more than one representation for many tuples (it uses the whole 8bit range rather than just [0,215]) and appears more efficient if you want all 3 outputs, but wasteful if you only want one.

Are there better approaches?

Target language is C but I've tagged this assembly as well since answering requires some consideration of the instructions that would be generated.

We'd have to know which architecture. Even so, different compilers may generate different code. For x86, `b%6` is less than 8 instructions, it's just 4 even counting all the loads and stores. Not to mention that number of instructions doesn't directly represent execution speed or code size. — Jester, Apr 06 '18 at 21:14
@Jester: It should be reasonably good across different architectures. Doing `b%6` as a div instruction (the only way I know to get it in <8 insns) is ridiculously slow and I don't generally consider div as part of "basic ALU ops". — R.. GitHub STOP HELPING ICE, Apr 06 '18 at 21:17
In terms of instructions a lookup table is probably shortest. And it uses zero ALU ops :) — Jester, Apr 06 '18 at 21:21
Yeah, a `3 * 256`-byte LUT would work for decoding. (Or `4 * 256`-byte to make the table indexing one insn cheaper / 1 cycle shorter latency if the LUT isn't static with a 32-bit absolute address for a `[table + rax + rax*2]` addressing mode. Not easy to parallelize with x86 SSSE3 `pshufb` (16 parallel byte lookups from a table of 16 bytes in an XMM register), because it's not separable: you can't just slice it into two 4-bit LUT lookups like you can for popcnt. — Peter Cordes, Apr 06 '18 at 21:30
LUT is also nice in that you can parameterize it with the index (0/1/2) of the digit you want rather than branching for different ALU code paths. — R.. GitHub STOP HELPING ICE, Apr 06 '18 at 21:30
Probably just a `uint8_t [3][256]` table indexed by which digit you want and the packed input byte. Or `[3][216]` to trade space/L1 for indexing cost (probably not worth it) or `[216][3]` to lessen that (*3 is cheapish). — R.. GitHub STOP HELPING ICE, Apr 06 '18 at 21:33
As you user C, counting instructions does not make sense, as that depends on way too many parameters, including the code. Additionally the number of instructions expecially on complex architectures like x86/64 is absolutely no reasonable measure for the run-time. In general don't do premature optimisations. **If** you have a performance problem, profile and optimise the hotspots. If you want specific instructions, write assembler code. — too honest for this site, Apr 06 '18 at 21:46
@Olaf: Unless your target is only a specific machine, you can't optimize to measurements. You need some sort of proxy, and what I've called "basic ALU ops" (assuming they all exist and take roughly 1 cycle on any reasonable architecture) is a fairly good proxy. It's totally possible to reason in terms of what underlying ops the compiler would have to generate without having a specific target in mind. — R.. GitHub STOP HELPING ICE, Apr 07 '18 at 00:14
I'd go with `uint8_t LUT[216][3]`, or especially a 3-byte struct, if you ever need multiple components from the same source byte. On x86 in position-dependent code with a static LUT, the `*3` is basically free, costing 1 byte of code-size and no extra latency given an index already zero-extended in a register. Otherwise it costs 1 cycle of latency for an LEA. (Or 2 cycles on AMD, where a scaled-index costs extra latency even with only base+index and no displacement.) ARM / AArch64 also has very cheap `*3`, with `add r1, r0, r0 lsl #1` — Peter Cordes, Apr 07 '18 at 01:29
Counting instructions (and not code size or ILP) is a mistake here. This would be a better question if you phrased your request for efficiency in terms of typical modern x86 and ARM CPUs, for example (i.e. with fairly good to excellent HW multiply units so they can implement div and mod by compile-time constants efficiently). I voted to reopen, but using a bogus metric for your efficiency criterion is not a good idea. — Peter Cordes, Apr 07 '18 at 01:33
Hmm, if you *only* ever need 1 at a time, then `[3][256]` is probably very reasonable. Especially if you often need "the middle number" from multiple sources in a row. That could give you more spatial locality between lookups if one of the three tables is rarely touched. `256-216 = 40`, so it's not a full cache line of unused entries, just *most* of a 64-byte cache line, unfortunately. (Standard for modern x86 and ARM.) — Peter Cordes, Apr 07 '18 at 02:47
Is packing 5x 3-bit integers into a `uint16_t` an option? Then you just shift/mask to extract the right one. Less dense than 3 per byte, though. — Peter Cordes, Apr 07 '18 at 02:48
Are you optimizing for latency or throughput here? (Typically total uop count is a good measure of impact on surrounding code; even two multiplies for mod / div are unlikely to bottleneck on that ALU port when mixed with surrounding code). Do you care about in-order CPUs (e.g. low-end ARM)? **Is the selector a runtime variable**, or do different usages of the unpacking code know at compile time which integer they want to extract from a group of 3? Will it tend to predict well? You mentioned branching on it. Probably not worth doing a branchless 3-way ALU strategy. — Peter Cordes, Apr 07 '18 at 08:14
@R..: I'm not sure what you call a "reasonable architecture". On one of the **existing** architectures ARM for example certain shifts don't take any time at all, because they are combined with other operations. On all larger architectures, caches, buffers, pipelines and e.g. speculative execution exist which can add for 1000 extra clock cycles or combine instructions, execute two in parallel, etc. But no need to go even that deep: compiler optimisations alone are complex enough to make your assumptions pointless already. These often depend on the bigger context, up to LTO. Thats why we profile — too honest for this site, Apr 07 '18 at 09:42

score 2 · Answer 1 · answered Apr 07 '18 at 09:30

As discussed in comments, a LUT would be excellent if it stays hot in cache. uint8_t LUT[3][256] would need the selector scaled by 256, which takes an extra instruction if it's not a compile-time constant. Scaling by 216 to pack the LUT better is only 1 or 2 instructions more expensive. struct3 LUT[216] is nice, where the struct has a 3-byte array member. On x86, this compiles extremely well in position-dependent code where the LUT base can be a 32-bit absolute as part of the addressing mode (if the table is static):

struct { uint8_t vals[3]; } LUT[216];
unsigned decode_LUT(uint8_t b, unsigned selector) {
    return LUT[b].vals[selector];
}

gcc7 -O3 on Godbolt for x86-64 and AArch64

    movzx   edi, dil
    mov     esi, esi                 # zero-extension to 64-bit: goes away when inlining.
    lea     rax, LUT[rdi+rdi*2]      # multiply by 3 and add the base
    movzx   eax, BYTE PTR [rax+rsi]  # then index by selector
    ret

Silly gcc used a 3-component LEA (3 cycle latency and runs on fewer ports) instead of using LUT as a disp32 for the actual load (no extra latency for an indexed addressing mode, I think).

This layout has the added advantage of locality if you ever need to decode multiple components of the same byte.

In PIC / PIE code, this costs 2 extra instructions, unfortunately:

    movzx   edi, dil
    lea     rax, LUT[rip]           # RIP-relative LEA instead of absolute as part of another addressing mode
    mov     esi, esi
    lea     rdx, [rdi+rdi*2]
    add     rax, rdx
    movzx   eax, BYTE PTR [rax+rsi]
    ret

But that's still cheap, and all the ALU instructions are single-cycle latency.

Your 2nd ALU unpacking strategy is promising. I thought at first we could use a single 64-bit multiply to get b*6, b*6*6, and b*6*6*6 in different positions of the same 64-bit integer. (b * ((6ULL*6*6<<32) + (36<<16) + 6)

But the upper byte of each multiply result does depend on masking back to 8-bit after each multiply by 6. (If you can think of a way to not require that, one multiple and shift would be very cheap, especially on 64-bit ISAs where the entire 64-bit multiply result is in one register).

Still, x86 and ARM can multiply by 6 and mask in 3 cycles of latency, the same or better latency than a multiply, or less on Intel CPUs with zero-latency movzx r32, r8, if the compiler avoids using parts of the same register for movzx.

add    eax, eax              ; *2
lea    eax, [rax + rax*2]    ; *3
movzx  ecx, al               ; 0 cycle latency on Intel
.. repeat for next steps

ARM / AArch64 is similarly good, with add r0, r0, r0 lsl #1 for multiply by 3.

As a branchless way to select one of the three, you could consider storing (from ah / ch / ... to get the shift for free) to an array, then loading with the selector as the index. This costs store/reload latency (~5 cycles), but is cheap for throughput and avoids branch misses. (Possibly a 16-bit store and then a byte reload would be good, scaling the selector in the load address and adding 1 to get the high byte, saving an extract instruction before each store on ARM).

This is in fact what gcc emits if you write it this way:

unsigned decode_ALU(uint8_t b, unsigned selector) {
    uint8_t decoded[3];
    uint32_t tmp = b * 6;
    decoded[0] = tmp >> 8;
    tmp = 6 * (uint8_t)tmp;
    decoded[1] = tmp >> 8;
    tmp = 6 * (uint8_t)tmp;
    decoded[2] = tmp >> 8;

    return decoded[selector];
}

    movzx   edi, dil
    mov     esi, esi
    lea     eax, [rdi+rdi*2]
    add     eax, eax
    mov     BYTE PTR -3[rsp], ah      # store high half of mul-by-6
    movzx   eax, al                   # costs 1 cycle: gcc doesn't know about zero-latency movzx?
    lea     eax, [rax+rax*2]
    add     eax, eax
    mov     BYTE PTR -2[rsp], ah
    movzx   eax, al
    lea     eax, [rax+rax*2]
    shr     eax, 7
    mov     BYTE PTR -1[rsp], al
    movzx   eax, BYTE PTR -3[rsp+rsi]
    ret

The first store's data is ready 4 cycles after the input to the first movzx, or 5 if you include the extra 1c of latency for reading ah when it's not renamed separately on Intel HSW/SKL. The next 2 stores are 3 cycles apart.

So the total latency is ~10 cycles from b input to result output, if selector=0. Otherwise 13 or 16 cycles.

Why not `(b*57)>>11` for the highest digit, and a single 3+3-bit packed LUT for the lower digits, `LUT[b] >> 3` and `LUT[b] & 7`, respectively? — Nominal Animal, Apr 07 '18 at 11:33
This looks like it's probably the right answer for the question as stated, even if not for my particular needs, so I'll probably accept it. Waiting a bit first to see if any additional interesting answers show up. — R.. GitHub STOP HELPING ICE, Apr 07 '18 at 23:56

score 1 · Answer 2 · answered Apr 07 '18 at 23:55

Measuring a number of different approaches in-place in the function that needs to do this, the practical answer is really boring: it doesn't matter. They're all running at about 50ns per call, and other work is dominating. So for my purposes, the approach that pollutes the cache and branch predictors the least is probably the best. That seems to be:

(b * (int[]){2048,342,57}[i] >> 11) % 6;

where b is the byte containing the packed values and i is the index of the value wanted. The magic constants 342 and 57 are just the multiplicative constants GCC generates for division by 6 and 36, respectively, scaled to a common shift of 11. The final %6 is spurious in the /36 case (i==2) but branching to avoid it does not seem worthwhile.

On the other hand, if doing this same work in a context where there wasn't an interface constraint to have the surrounding function call overhead per lookup, I think an approach like Peter's would be preferable.

Encoding 3 base-6 digits in 8 bits for unpacking performance

2 Answers2