This is an optimization problem. I want to copy a bitfield of six 5-bit elements to a u8 buffer, naively done like this:
void Expand(u32 x, u8 b[6]) {
b[0] = (x >> 0) & 31;
b[1] = (x >> 5) & 31;
b[2] = (x >> 10) & 31;
b[3] = (x >> 15) & 31;
b[4] = (x >> 20) & 31;
b[5] = (x >> 25) & 31;
}
This is the assembly generated by x86 msvc v19.latest
, flags /O2 /Ot /Gr
, gcc and clang will give approximately the same thing.
@Expand@8 PROC
mov al, cl
and al, 31
mov BYTE PTR [edx], al
mov eax, ecx
shr eax, 5
and al, 31
mov BYTE PTR [edx+1], al
mov eax, ecx
shr eax, 10
and al, 31
mov BYTE PTR [edx+2], al
mov eax, ecx
shr eax, 15
and al, 31
mov BYTE PTR [edx+3], al
mov eax, ecx
shr eax, 20
shr ecx, 25
and al, 31
and cl, 31
mov BYTE PTR [edx+4], al
mov BYTE PTR [edx+5], cl
ret 0
@Expand@8 ENDP
But I just don't like it; I know it does exactly what it should be doing, it just seems to me that it could be a lot more efficient.
To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.
11111 11111 11111 11111 11111 11111
↓
00011111 00011111 00011111 00011111 00011111 00011111
I've been trying SHIFTing, ORing, only ANDing at the end with a u64 (0x1f1f1f1f1f1f
), but I remain unsuccessful in my optimization efforts. I'm confident this should be doable in less than 10 instructions, any guidance would be appreciated.
EDIT
I scratched my head a little bit more and this is the best I could come up with so far:
void Expand(u32 x, u8 b[6]) {
memset(b, 31, 6);
b[0] &= x;
b[1] &= x >>= 5;
b[2] &= x >>= 5;
b[3] &= x >>= 5;
b[4] &= x >>= 5;
b[5] &= x >>= 5;
}
Compiles to:
@Expand@8 PROC
mov eax, 0x1f1f1f1f
mov DWORD PTR [edx], eax
mov WORD PTR [edx+4], ax
and BYTE PTR [edx], cl
shr ecx, 5
and BYTE PTR [edx+1], cl
shr ecx, 5
and BYTE PTR [edx+2], cl
shr ecx, 5
and BYTE PTR [edx+3], cl
shr ecx, 5
and BYTE PTR [edx+4], cl
shr ecx, 5
and BYTE PTR [edx+5], cl
ret 0
@Expand@8 ENDP