2

This is an optimization problem. I want to copy a bitfield of six 5-bit elements to a u8 buffer, naively done like this:

void Expand(u32 x, u8 b[6]) {
    b[0] = (x >> 0) & 31;
    b[1] = (x >> 5) & 31;
    b[2] = (x >> 10) & 31;
    b[3] = (x >> 15) & 31;
    b[4] = (x >> 20) & 31;
    b[5] = (x >> 25) & 31;
}

This is the assembly generated by x86 msvc v19.latest, flags /O2 /Ot /Gr, gcc and clang will give approximately the same thing.

@Expand@8 PROC
        mov     al, cl
        and     al, 31
        mov     BYTE PTR [edx], al
        mov     eax, ecx
        shr     eax, 5
        and     al, 31
        mov     BYTE PTR [edx+1], al
        mov     eax, ecx
        shr     eax, 10
        and     al, 31
        mov     BYTE PTR [edx+2], al
        mov     eax, ecx
        shr     eax, 15
        and     al, 31
        mov     BYTE PTR [edx+3], al
        mov     eax, ecx
        shr     eax, 20
        shr     ecx, 25
        and     al, 31
        and     cl, 31
        mov     BYTE PTR [edx+4], al
        mov     BYTE PTR [edx+5], cl
        ret     0
@Expand@8 ENDP

But I just don't like it; I know it does exactly what it should be doing, it just seems to me that it could be a lot more efficient.
To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.

                  11111 11111 11111 11111 11111 11111
                                                    ↓
00011111 00011111 00011111 00011111 00011111 00011111

I've been trying SHIFTing, ORing, only ANDing at the end with a u64 (0x1f1f1f1f1f1f), but I remain unsuccessful in my optimization efforts. I'm confident this should be doable in less than 10 instructions, any guidance would be appreciated.

EDIT

I scratched my head a little bit more and this is the best I could come up with so far:

void Expand(u32 x, u8 b[6]) {
    memset(b, 31, 6);
    b[0] &= x;
    b[1] &= x >>= 5;
    b[2] &= x >>= 5;
    b[3] &= x >>= 5;
    b[4] &= x >>= 5;
    b[5] &= x >>= 5;
}

Compiles to:

@Expand@8 PROC
        mov     eax, 0x1f1f1f1f
        mov     DWORD PTR [edx], eax
        mov     WORD PTR [edx+4], ax
        and     BYTE PTR [edx], cl
        shr     ecx, 5
        and     BYTE PTR [edx+1], cl
        shr     ecx, 5
        and     BYTE PTR [edx+2], cl
        shr     ecx, 5
        and     BYTE PTR [edx+3], cl
        shr     ecx, 5
        and     BYTE PTR [edx+4], cl
        shr     ecx, 5
        and     BYTE PTR [edx+5], cl
        ret     0
@Expand@8 ENDP
phuclv
  • 37,963
  • 15
  • 156
  • 475
MAP
  • 23
  • 5
  • Possibly related? https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector – alagner Sep 18 '22 at 08:15
  • 1
    Have you measured that this is an actual bottleneck in your program? Unless it's one of the top-two I wouldn't bother. – Some programmer dude Sep 18 '22 at 08:15
  • What do you want to optmise for? Please select from the list in the info for the tag you used. https://stackoverflow.com/tags/optimization/info And asking for "most optimised" still ends up being opinionated. – Yunnosch Sep 18 '22 at 08:23
  • I think the problem is in your intuition. I don't see how you would expect this to be fewer operations. – Cheatah Sep 18 '22 at 08:25
  • 3
    AVX512VBMI `vpmultishiftqb` can do this in one instruction (well, two or three including loading a mask constant). See [SIMD unpack 12-bit fields to 16-bit](https://stackoverflow.com/q/66347928) for a related problem using SIMD shuffles and ANDs, although 16-12 = 4 being a factor of 8 makes that much easier. If you have *efficient* BMI2 `pdep` (Zen3 or Haswell), you can also do it in one instruction, but it's a disaster on earlier AMD, slower than regular unpacking. (Well, maybe 2 separate `pdep` if you're stuck in 32-bit mode so you can't get all 6 bytes into one register) – Peter Cordes Sep 18 '22 at 08:26
  • @Cheatah: Yeah, for it to be many fewer operations, you need primitive operation building blocks like `pdep` that aren't available in the baseline instruction set. e.g. an AVX2 variable-count shift (`vpsrlvd`) where you can do 8 separate shift counts on individual separate 32-bit elements. So widen to 16 (avx-512) or even 32-bit elements where each element, with a byte shuffle to grab the right 2 bytes into every element (after a broadcast + `vpshufb`), then pack back down and mask to keep the low bits in each byte. – Peter Cordes Sep 18 '22 at 08:29
  • @Yunnosch I want to write C code that will yield the most efficient x86 assembly proc with the compilation environment I described. – MAP Sep 18 '22 at 08:29
  • 1
    Don't optimize unless actually needed. Hand-optimizing code tend to make it harder to read, understand and maintain. You need plenty of documentation to describe such code. And unless there's an actual performance requirement, and the code is profiled to be a major bottleneck, hand-optimizations tend to make more work than its worth. – Some programmer dude Sep 18 '22 at 08:32
  • Are you willing to use x86 intrinsics with an `#ifdef`, or does the same version of the function have to be portable to other ISAs like AArch64 that have different SIMD? MSVC lets you use intrinsics without enabling the ISAs on the command line, so your MSVC setup doesn't rule out using AVX2. – Peter Cordes Sep 18 '22 at 08:33
  • @PeterCordes Ideally I'd like this to just be "C code that compiles well on evey arch", so no weird instructions or hidden intrinsics, even if my primary focus is still the x86 base ISA. – MAP Sep 18 '22 at 08:42
  • 1
    Ok, then you probably can't go much faster. IDK if there's anything useful in terms of masking odd/even fields and shifting two at once, but probably not since they need to be at different offsets. Compilers probably need help to figure out how to use `pdep` or SIMD tricks, and won't do it for you. So if you don't want to tell it how, you're not going to get the "most efficient way possible". – Peter Cordes Sep 18 '22 at 08:52
  • The second and third can be done with a 16-bit register instead of 32-bit if that shaves anything. And then, if `ecx` is shifted 15 bits, the next three can also be done with one 8-bit and two 16-bit operations. – Weather Vane Sep 18 '22 at 08:59
  • Apart from that, if "lines of code" is the issue, you could have a loop. – Weather Vane Sep 18 '22 at 09:07
  • "most efficeint assembly code" fine. Most efficient for what? Please select from the list. – Yunnosch Sep 18 '22 at 09:24
  • Divide and conquer... If you can use 64 bits (I can't) a shift and a mask would turn your 30 bits into 2 clusters of 15 bits that could be "parallel" processed. You can extract quintets 0 & 3 (perhaps use a union?), shift right and get 1 & 4, then shift right one last time to get 2 & 5... Just a thought... – Fe2O3 Sep 18 '22 at 09:49
  • @MargaretBloom could you explain a bit more of what's happening with the magic number for the mask and what you mean by "image" in your code comment? Is the table then just supposed to be an array holding all 6-byte values? – MAP Sep 18 '22 at 09:58
  • May be you can try with an struct with bit fields mapped on the 32 bits integer via pointer. – Ptit Xav Sep 18 '22 at 10:03
  • @MAP I misunderstood your question. I though you were only interested in a single bit. Sorry, ignore my previous comment. – Margaret Bloom Sep 18 '22 at 12:34
  • I think you can do a little better by building up the result in a `uint64_t x;` and then `memcpy(b, &x, 6)`, assuming a little-endian machine. At least then you only get a 32-bit and a 16-bit store at the end, instead of all these byte accesses. – Nate Eldredge Sep 18 '22 at 14:50
  • This is not my skill set, but will note that `gcc -O4` spends a lot of time on this and chooses parallel instructions. I don't know if they're a win. https://gcc.godbolt.org/z/noG9n7Yqj – Gene Sep 19 '22 at 01:48
  • ... I suspect not at least for some processors, since the code you posted (iiuc) looks like it would allow as many parallel dispatches as integer data paths. – Gene Sep 19 '22 at 01:54

3 Answers3

4

Here's a cross-platform solution that only needs a fast multiplier that's available on almost all desktop architectures

void Expand(uint32_t x, uint8_t b[6]) {
    uint32_t x024 = x & 0b00'00000'11111'00000'11111'00000'11111;
    uint32_t x135 = x & 0b00'11111'00000'11111'00000'11111'00000;
    uint64_t r024 = x024 * 0x0100'0000'4000'0010ULL & 0x1F001F001F000000;
    uint64_t r135 = x135 * 0x0040'0000'1000'0004ULL & 0x001F001F001F0000;
    uint64_t result = r024 | (r135 >> 11);
#if !BIG_ENDIAN
    result = htonll(result);
#endif
    memcpy(b, &result, 6);
}

See below for the detailed math. It needs ~8-9 operations and run in 2 parallel chains. You can improve that by passing an 8-byte array instead of 6 and restoring the last 2 elements b[6]/b[7] later if necessary.

But you should really use #ifdef and provide efficient implementations for each supported platform and a fallback generic solution like the above for other platforms. The fastest way on x86 is SIMD or PDEP depending on whether you do this for a big array or just do it sporadically. All other platforms also have their own SIMD that can be used to accelerate this. Alternatively you use platform-agnostic SIMD libraries to automatically emit efficient SIMD code for any architectures, too.


Note that instruction count is not a measure for performance. Not all instructions are equal. Your "best" is actually more terrible than the first version because it has a long dependency chain whereas the CPU can start 5 independent executions at the same time and run in parallel with the latter

Remember many instructions are slow so multiple simpler equivalent instructions would be faster. Multiple instructions that can be executed in parallel would also be faster than a shorter sequence that have dependencies. And a short loop is also worse than a straightforward run


The math behind the algorithm

Let the input be 32 bits of 00aaaaabbbbbcccccdddddeeeeefffff. The multiplication will produce the bits in the correct position after masking

                                  0000000bbbbb00000ddddd00000fffff (x024)
× 0000000100000000000000000000000001000000000000000000000000010000 (0x0100'0000'4000'0010)
  ────────────────────────────────────────────────────────────────
                              0000000bbbbb00000ddddd00000fffff
    0000000bbbbb00000ddddd00000fffff
+ 000fffff
  0000000100000000000000000000000001000000000000000000000000010000
  ────────────────────────────────────────────────────────────────
& 0001111100000000000111110000000000011111000000000000000000000000 (0x1F001F001F000000)
  ────────────────────────────────────────────────────────────────
= 000fffff00000000000ddddd00000000000bbbbb000000000000000000000000
                                  00aaaaa00000ccccc00000eeeee00000 (x135)
× 0000000001000000000000000000000000010000000000000000000000000100 (0x0040'0000'1000'0004)
  ────────────────────────────────────────────────────────────────
                                00aaaaa00000ccccc00000eeeee00000
+     00aaaaa00000ccccc00000eeeee00000
  eeeee00000
  ────────────────────────────────────────────────────────────────
& 11111000000000001111100000000000111110000000000000000            (0x001F001F001F0000)
  ────────────────────────────────────────────────────────────────
= eeeee00000000000ccccc00000000000aaaaa000000000000000000000000000

Merging the above two results we have 000fffff000eeeee000ddddd000ccccc000bbbbb000aaaaa0000000000000000 which will contain the expected bytes in the correct order when storing as big endian in memory

Output assembly for comparison

For more details about the algorithm see How to create a byte out of 8 bool values (and vice versa)?

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Thank you for this insight, I admit I didn't pay attention to dependency chains and parallel execution since I'm not really familiar with those, but yeah I know instruction count is not the only measure of performance. – MAP Sep 19 '22 at 07:01
  • 1
    @MAP: A single dependency chain is bad for latency, but if it can overlap with surrounding code it can be ok for throughput. *If* the instructions are all cheap, like single uop, not memory-destination `and` which is 2 uops. (https://uops.info/ / https://agner.org/optimize/). One way to do the chain of shifts strategy might be to mask into odd/even elements, so the low byte would always have 1 field zero-extended to fill the byte, ready to store. Like `even = x&mask` ; `odd = (x>>5)& mask` then store the low bytes and repeat `even>>=5;` `odd >>= 5;`. That has 2 parallel dep chains. – Peter Cordes Sep 19 '22 at 21:30
2

To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.

I can see why you say that, but inasmuch as you also say you want to be architecture agnostic, it's not quite right. There is an important but somewhat subtle distinction between your bitfield representation and your byte-array representation: bitfield identity / index within the packed number is a function of value whereas byte identity / index within the array is a function of storage order. Thus, it would be closer to say that you want to convert a 30-bit native-format number to a 48-bit little-endian number.

One way to accomplish that is simply by reading out the number into the destination array, which is exactly what the two alternatives presented in the question do. In that sense, you're already doing it. But if you imagine arithmetically expanding the number as a separate step then you have to acknowledge that you need afterwards to store it in the array. I suppose you have in mind to memcpy() it into place, but note well that for that arithmetic + memcpy() to make sense as an alternative to direct read-out, the arithmetic has to be arch-sensitive.

But let's explore the arithmetic a bit anyway. After all, maybe you don't care about non-little-endian architectures. Given your comment that ...

Ideally I'd like this to just be "C code that compiles well on evey arch", so no weird instructions or hidden intrinsics

..., I'll consider only the operations provided by standard C. The objective is to compute a 64-bit integer containing the wanted expansion in its least-significant 48 bits.

A key limitation here is that each bitfield of the input has to be shifted a different distance. The most straightforward approach is to do it field by field, maybe something like this:

    uint64_t expanded =
        ( (uint64_t) x)       &           0x1f) +
        (((uint64_t) x << 3)  &         0x1f00) +
        (((uint64_t) x << 6)  &       0x1f0000) +
        (((uint64_t) x << 9)  &     0x1f000000) +
        (((uint64_t) x << 12) &   0x1f00000000) +
        (((uint64_t) x << 15) & 0x1f0000000000);

Or perhaps a compiler would treat this variation more kindly:

    uint64_t temp = x;
    uint64_t expanded = temp &   0x1f;
    temp <<= 3;
    expanded |= temp &         0x1f00;
    temp <<= 3;
    expanded |= temp &       0x1f0000;
    temp <<= 3;
    expanded |= temp &     0x1f000000;
    temp <<= 3;
    expanded |= temp &   0x1f00000000;
    temp <<= 3;
    expanded |= temp & 0x1f0000000000;

But of course, both of those perform more arithmetic operations than your alternatives do, so there is little reason to expect simpler assembly code. Nevertheless, you might see an overall performance improvement arising from fewer accesses to memory to store the result (assuming use of memcpy(); not shown, assumed to be optimized to avoid an actual function call).

Perhaps you were looking for a bit-twiddling hack to get more work out of fewer arithmetic operations. The scope for that is small, because you have only six fields to operate upon in the first place. The only way to factor that other than 6 (x 1) is 3 x 2, and with the latter you must expect at best 3 + 2 - 1 = 4 sets of operations instead of 6 sets. Something like this:

uint64_t temp = (((uint64_t) x << 9) & 0xffffff000000) | (x & 0x7fff);
uint64_t expanded = temp &     0x1f00001f;
temp <<= 3;
expanded |=         temp &   0x1f00001f00;
temp <<= 3;
expanded |=         temp & 0x1f00001f0000;

That yields a moderate gain in terms of arithmetic operation count relative to the straightforward versions: three shifts instead of five, five ands instead of six, and three ors instead of five. The compiler might or might not treat this code as kindly as it does a different one. And it's still more arithmetic operations than your direct read-out alternatives.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • Interesting read but yeah, it looks like compilers just didn't like any of them for x86, although clang for ARM seems to be doing alright (still more instructions than my second solution unfortunately). – MAP Sep 18 '22 at 18:04
  • Well yes, @MAP, that's indeed what I would expect, and what this answer anticipates. Refer to the last sentence. But consider also "Nevertheless, you might see an overall performance improvement arising from fewer accesses to memory to store the result". – John Bollinger Sep 18 '22 at 18:08
  • If your measure is number of assembly instructions, @MAP, then I don't think you do better than the best alternative presented in the question. – John Bollinger Sep 18 '22 at 18:11
1

Your version has 6 shifts (5 really) and 6 maskings (and 6 assignment, naturally.)

I'd suggested "divide & conquer" in the comments.

This version has 4 shifts and 4 masks. It could probably be tidied up, and I've no idea what the assembly for it looks like. Would be interesting to see!

Anyway...

void expand( uint32_t x, uint8_t b[ 6 ] ) {
    union {
        uint32_t val32;
        struct {
            uint16_t lo;
            uint16_t hi;
        } b15;
    } v, valt;

    v.val32 = x << 1; // to shove b15 into b16
    v.b15.lo = (uint16_t)(x & 0xFFFF);

    valt.val32 = (v.val32 >> 0) & 0x001F001F;
    b[0] = (uint8_t)valt.b15.lo;
    b[3] = (uint8_t)valt.b15.hi;

    valt.val32 = (v.val32 >> 5) & 0x001F001F;
    b[1] = (uint8_t)valt.b15.lo;
    b[4] = (uint8_t)valt.b15.hi;

    valt.val32 = (v.val32 >>10) & 0x001F001F;
    b[2] = (uint8_t)valt.b15.lo;
    b[5] = (uint8_t)valt.b15.hi;
}

int main() {
    uint8_t b[ 6 ] = { 7, 24, 31, 0, 6, 1, }; // 0 - 31 only!!

    uint32_t x = (b[5]<<25) | (b[4]<<20) | (b[3]<<15) | (b[2]<<10) | (b[1]<<5) | (b[0]<<0);

    memset( b, 0, sizeof b );

    expand( x, b );

    for( int i = 0; i < 6; i++ )
        printf( "[%d] %u  ", i, b[i] );
    puts( "" );

    return 0;
}

Output

[0] 7  [1] 24  [2] 31  [3] 0  [4] 6  [5] 1

EDIT

Some people don't know how to express appreciation for offered help. To be clear, it is not my project. I wrote that I didn't bother looking at the assembly output. The criticism in the comment below seems excessive, imo.

So, having nothing else on, here's a "readable" revision that works in 32 bit architectures. "memory accesses" are an unavoidable result of not having a 64bit register with which to work.

void expand( uint32_t x, uint8_t b[ 6 ] ) {
    union {
        uint32_t x;
        struct { uint8_t l, lx, h, hx; } q;
    } u;

    x = ((x << 1)&0xffff0000)|(x & 0x0000ffff);

    u.x = (x >> 0) & 0x001f001f;
    b[0] = u.q.l;
    b[3] = u.q.h;

    u.x = (x >> 5) & 0x001f001f;
    b[1] = u.q.l;
    b[4] = u.q.h;

    u.x = (x >>10) & 0x001f001f;
    b[2] = u.q.l;
    b[5] = u.q.h;
}
Fe2O3
  • 6,077
  • 2
  • 4
  • 20
  • 1
    I can see what you tried to do but unfortunately this is equivalent in terms of generated instructions, worse in terms of memory accesses and worse regarding C code readability. – MAP Sep 18 '22 at 13:48