When joining four 1-byte vars into one 4-byte word, which is a faster way to shift and OR ? (comparing generated assembly code)

Question

So I'm currently studying bit-wise operators and bit-manipulation, and I have come across two different ways to combine four 1-byte words into one 4-byte wide word.

the two ways are given below

After finding out this two methods I compare the disassembly code generated by the two (compiled using gcc 11 with -O2 flag), I don't have the basic knowledge with disassembly and with the code it generates, and what I only know is the shorter the code, the faster the function is (most of the time I guess... maybe there are some exceptions), now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?

I also got curious about the order of the instructions, the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or does this have some significant effect in the performance say for example if we are dealing with a larger word?

Two methods

int method1(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    int combine = 0;
    combine = byte4;
    combine <<=8;
    combine |= byte3;
    combine <<=8;
    combine |= byte2;
    combine <<=8;
    combine |= byte1;
    return combine;
}

int method2(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    int combine = 0, temp;
    temp = byte4;
    temp <<= 24;
    combine |= temp;
    temp = byte3;
    temp <<= 16;
    combine |= temp;
    temp = byte2;
    temp <<= 8;
    combine |= temp;
    temp = byte1;
    combine |= temp;
    return combine;
}

Disassembly

// method1(unsigned char, unsigned char, unsigned char, unsigned char):
        movzx   edi, dil
        movzx   esi, sil
        movzx   edx, dl
        movzx   eax, cl
        sal     edi, 8
        or      esi, edi
        sal     esi, 8
        or      edx, esi
        sal     edx, 8
        or      eax, edx
        ret
// method2(unsigned char, unsigned char, unsigned char, unsigned char):
        movzx   edx, dl
        movzx   ecx, cl
        movzx   esi, sil
        sal     edi, 24
        sal     edx, 8
        sal     esi, 16
        or      edx, ecx
        or      edx, esi
        mov     eax, edx
        or      eax, edi
        ret

This might be "premature optimization", but I just want to know if there is a difference.

Relative speed will depend on many things, including the machine instruction set and how the compiler does optimisation in both cases. You might also want to consider the question of correctness, since (1) your code is doing bitwise operations on a signed integral type and (2) an `int` (whether `signed` or `unsigned`) is not actually guaranteed to be represented using four bytes (even if that is common on modern machines). — Peter, Nov 14 '21 at 05:12
@Peter "(2) an int (whether signed or unsigned) is not actually guaranteed to be represented using four bytes"... how about using std::bitset? — 0xdeadbeef, Nov 14 '21 at 05:29
You can use whatever you like. The point of my comment is that it is rarely a good idea to optimise performance if your optimised code may be used in circumstances that it produces incorrect results. Your code makes assumptions that are not guaranteed for all C++ implementations, so you need to decide if that is sufficient (e.g. if you will only ever use the code in situations where it will run correctly) - and preferably document those assumption and/or force the code to not compile if those assumptions are false. — Peter, Nov 14 '21 at 05:40
@Peter You don't have to try to show off. You could have just commented he should use a `long int` or an `unsigned long int`. — user904963, Nov 14 '21 at 06:52
`temp <<= 24;` This will cause undefined behaviour when `byte4 > 127` — eerorika, Nov 14 '21 at 07:16
kabibe sadagat, See [@eerorika](https://stackoverflow.com/questions/69960323/when-combining-four-1-byte-into-one-4-byte-word-which-is-much-faster#comment123670093_69960323). Why care about speed when the functionality is not certain? Note: neither approach is _much faster_ than the other. — chux - Reinstate Monica, Nov 14 '21 at 11:09
@chux - Reinstate Monica as I have said I'm currently studying these concepts and I got curious and I want to know — 0xdeadbeef, Nov 14 '21 at 11:25
@eerokia why would it cause undefined behaviour when byte4 is > 127? Isin't it an ```unsigned char``` is exactly 8-bits so the maximum value should be 255? and ```int```s have 32-bit so if I shift 24 to the left the byte would still fit? — 0xdeadbeef, Nov 14 '21 at 11:28
kabibesadagat, _curious_ is good. Wanting to know is good. Thinking one is _much faster_ is amiss. Relying on _undefined behavior_ for correct code functionality is _bad_ - shifting `<<` into the sign bit of `int` is UB. Using `uint8_t, uint32_t` instead of `unsigned char, int` makes more sense. — chux - Reinstate Monica, Nov 14 '21 at 11:33
Oh I see, is it because of the extra bit for the signedness? — 0xdeadbeef, Nov 14 '21 at 11:44
Not exactly, it's because of annoying C rules where signed-overflow UB applies even when left-shifting, not just with arithmetic operations like `+` and `*`. It's exactly like `<<= 1` being a `*=2`, not necessarily shifting the bit-pattern. (I guess that might make sense for one's complement.) On 2's complement machines this is just an annoyance. On sign/magnitude or one's complement, it might be a meaningful difference. — Peter Cordes, Nov 14 '21 at 19:35

Peter Cordes · Accepted Answer · 2021-11-16T23:34:18.877

now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?

That hasn't been true for decades. Modern CPUs can execute instructions in parallel if they're independent of each other. See

How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

In your case, method 2 is clearly better (with GCC11 -O2 specifically) because the 3 shifts can happen in parallel, leaving only a chain of or instructions. (Most modern x86 CPUs only have 2 shift units, but the 3rd shift can be happening in parallel with the first OR).

Your first version has one long dependency chain of shift/or/shift/or (after the movzx zero-extension), so it has the same throughput but worse latency. If it's not on the critical path of some larger computation, performance would be similar.

The first version also has a redundant movzx edi, dil, because GCC11 -O2 doesn't realize that the high bits will eventually get shifted out the top of the register by the 3x8 = 24 bits of shifting. Unfortunately GCC chose to movzx into the same register (RDI into RDI), not for example movzx eax, dil which would let mov-elimination work.

The second one has a wasted mov eax, edx because GCC didn't realize it should do one of the movzx operations into EAX, instead of zero-extending each reg into itself (defeating mov-elimination). It also could have used lea eax, [edx + edi] to merge into a different reg, because it could have proved that those values couldn't have any overlapping bit-positions, so | and + would be equivalent.

That wasted mov generally only happens in small functions; apparently GCC's register allocator has a hard time when values need to be in specific hard registers. If it had its choice of where to produce the value, it would just end up with it in EDX.

So on Intel CPUs, yes by coincidence of different missed optimizations, both versions have 3 non-eliminated movzx and one instruction which can benefit from mov-elimination. On AMD CPUs, movzx is never eliminated, so the movzx eax, cl doesn't benefit.

Unfortunately, your compiler chooses to do all three or operations in sequence, instead of a tree of dependencies. (a|b) | (c|d) would be lower worst-case latency than (((a|b) | c) | d), critical path length of 2 from all inputs vs. 3 from d, 2 from c, 1 from a or b. (Writing it in C those different ways doesn't actually change how compilers make asm, because they knows that OR is associative. I'm using that familiar syntax to represent to dependency pattern of the assembly.)

So if all four inputs were ready at the same time, combining pairs would be lower latency, although it's impossible for most CPUs to produce three shift results in the same cycle.

I was able to hand-hold earlier GCC (GCC5) into making this dependency pattern (Godbolt compiler explorer). I used unsigned to avoid ISO C undefined behaviour. (GNU C does define the behaviour of left shifts even when a set bit is shifted in or out of the sign bit.)

int method4(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    unsigned combine = (unsigned)byte4 << 24;
    combine |= byte3 << 16;
    unsigned combine_low = (byte2 << 8) | byte1;
    combine |= combine_low;
    return combine;
}

# GCC5.5 -O3
method4(unsigned char, unsigned char, unsigned char, unsigned char):
        movzx   eax, sil
        sal     edi, 24
        movzx   ecx, cl
        sal     eax, 16
        or      edi, eax    # (byte4<<24)|(byte3<<16)
        movzx   eax, dl
        sal     eax, 8
        or      eax, ecx    # (byte2<<8) | (byte1)
        or      eax, edi    # combine halves
        ret

But GCC11.2 makes the same asm for this vs. method2. That would be good if it was optimal, but it's not.

the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or

The dependency chains are the key factor. Mainstream x86 has been out-of-order exec for over 2 decades, and there haven't been any in-order exec x86 CPUs sold for years. So instruction scheduling (ordering of independent instructions) generally isn't a big deal over very small distances like a few instructions. Of course, in the alternating shl/or version, they're not independent so you couldn't reorder them without breaking it or rewriting it.

Can we do better with partial-register shenanigans?

This part is only relevant if you're a compiler / JIT developer trying to get a compiler to do a better job for source like this. I'm not sure there's a clear win here, although maybe yes if we can't inline this function so the movzx instructions are actually needed.

We can certainly save instructions, but even modern Intel CPUs still have partial-register merging penalties for high-8-bit registers like AH. And it seems the AH-merging uop can only issue in a cycle by itself, so it effectively costs at least 4 instructions of front-end bandwidth.

    movzx eax, dl
    mov   ah, cl
    shl   eax, 16        # partial register merging when reading EAX
    mov   ah, sil        # oops, not encodeable, needs a REX for SIL which means it can't use AH
    mov   al, dil

Or maybe this, which avoids partial-register stalls and false dependencies on Intel Haswell and later. (And also on uarches that don't rename partial regs at all, like all AMD, and Intel Silvermont-family including the E-cores on Alder Lake.)

# good on Haswell / AMD
# partial-register stalls on Intel P6 family (Nehalem and earlier)
merge4(high=EDI, mid_hi=ESI, mid_lo=EDX, low=ECX):
   mov   eax, edi       # mov-elimination possible on IvB and later, also AMD Zen
                        # zero-extension not needed because more shifts are coming
   shl   eax, 8
   shl   edx, 8
   mov   al, sil        # AX = incoming DIL:SIL
   mov   dl, cl         # DX = incoming DL:CL
   shl   eax, 16        
   mov   ax, dx         # EAX = incoming DIL:SIL:DL:CL
   ret

This is using 8-bit and 16-bit mov as an ALU merge operation pretty much like movzx + or, i.e. a bitfield insert into the low 8 or low 16. I avoided ever moving into AH or other high-8 registers, so there's no partial-register merging on Haswell or later.

This is only 7 total instructions (not counting ret), all of them single-uop. And one of them is just a mov which can often be optimized away when inlining, because the compiler will have its choice of which registers to have the value in. (Unless the original value of the high byte alone is still needed in a register). It will often know it already has a value zero-extended in a register after inlining, but this version doesn't depend on that.

Of course, if you were eventually storing this value to memory, doing 4 byte stores would likely be good, especially if it's not about to be reloaded. (Store-forwarding stall.)

Why doesn't GCC use partial registers? partial regs on other CPUs, and why my "good" version would be bad on crusty old Nehalem and earlier CPUs.
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent the basis for the partial-register versions
Why does GCC chose dword movl to copy a long shift count to CL? - re: using 32-bit mov to copy an incoming uint8_t. (Although that's saying why it's fine even on CPUs that do rename DIL separate from the full RDI, because callers will have written the full register.)

Semi-related:

Doing this without shifts, just partial-register moves, as an asm exercise: How do I shift left in assembly with ADD, not using SHL? sometimes using store/reload which will cause store-forwarding stalls.

@PeterCordes I think your optimized version at the end swap `mid_hi` and `high`? AFAICT it should be `movzx %dil, %eax; shl $8, %eax; shl $8 %edx; mov %sil, %al; /* rest the same */`. Although couldn't you do slightly better with `shl $8, %edi; shl $8, edx; mov %sil, %dil; mov %cl, %dl; shl $16, %edi; mov %edi, %eax; mov %dx, %ax`. (Difference is pure `mov` instead of `movzx`). — Noah, Nov 16 '21 at 19:20
@Noah: Yes, thanks, fixed. I remember finding it weird the whole time I was writing the answer that the args started with the high byte. Although it's like `_mm_set_epi8` so I guess it's not that weird. Anyway, I chose to still *start* with `mov` instead of `shl`, so I'm overwriting the `mov` result with the shift result, not with possible partial-reg shenanigans on IvB, in case it still renames AL separately from RAX while also doing mov-elimination. — Peter Cordes, Nov 16 '21 at 23:36
@PeterCordes `return (byte4 * 256u + byte3) * 65536u + (byte2 * 256u + byte1);` maps to the desired operation tree with [x86-64 clang](https://godbolt.org/z/oae7qs4E3). — njuffa, Nov 23 '21 at 07:50

score 4 · Answer 2 · answered Nov 14 '21 at 05:18

I completely agree with Émerick Poulin:

So the real answer, would be to benchmark it on your target machine to see which one is faster.

Nevertheless, I created a "method3()", and disassembled all three with gcc version 10.3.0, with -O0 and -O2, Here's a summary of the -O2 results:

Method3:

int method3(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    int combine = (byte4 << 24)|(byte3<<16)|(byte2<<8)|byte1;
    return combine;
}

gcc -O2 -S:

;method1:
    sall    $8, %eax
    orl %edx, %eax
    sall    $8, %eax
    movl    %eax, %edx
    movzbl  %r8b, %eax
    orl %edx, %eax
    sall    $8, %eax
    orl %r9d, %eax
    ...
;method2:
    sall    $8, %r8d
    sall    $16, %edx
    orl %r9d, %r8d
    sall    $24, %eax
    orl %edx, %r8d
    orl %r8d, %eax
    ...
;method3:
    sall    $8, %r8d
    sall    $16, %edx
    orl %r9d, %r8d
    sall    $24, %eax
    orl %edx, %r8d
    orl %r8d, %eax

method1 has fewer instructions than method2, and method2 compiles to exactly the same code has method3. method1 also has a couple of "moves", which are more "expensive" than "or" or "shift".

moves between registers are not more expensive than "or" or "shift" — harold, Nov 14 '21 at 07:10
@harold Actually, they are likely to be a lot cheaper. Cheaper as in, they are likely to get eliminated in the frontend. — fuz, Nov 14 '21 at 10:42

score 1 · Answer 3 · answered Nov 14 '21 at 05:02

1

Without optimizations, method 2 seems to be a tiny bit faster.

This is an online benchmark of the code provided: https://quick-bench.com/q/eyiiXkxYVyoogefHZZMH_IoeJss

However, it is difficult to get accurate metrics from tiny operations like this. It also depends on the speed of the instructions on a given CPU (different instructions can take more or less clock cycles).

Furthermore, bitwise operators tend to be pretty fast since they are "basic" instructions.

So the real answer, would be to benchmark it on your target machine to see which one is faster.

answered Nov 14 '21 at 05:02

0xd34dc0de

493
4
10

4

There's little point in benchmarking unoptimised code. The speed of unoptimised code does not correlate with the optimised version. – rici Nov 14 '21 at 05:24
I see thank you for this – 0xdeadbeef Nov 14 '21 at 05:25

James Lecka · Answer 4 · 2021-11-21T19:02:57.080

0

a 'c' only portable.

unsigned int portable1(unsigned char byte4, unsigned char byte3,
                       unsigned char byte2, unsigned char byte1)
{
union tag1 
  {
  _uint32 result;
  _uint8 a[4];
  } u;
 u.a[0] = byte1;
 u.a[1] = byte2;
 u.a[2] = byte3;
 u.a[3] = byte4;
 return u.result;
}

This should generate 4 MOV and a register load in most environments. If the union is defined at module scope or global, the final LOAD (or MOV) disappears from the function. We can also easily allow for a 9bit byte and Intel versus Network byte ordering...

edited Nov 21 '21 at 19:02

answered Nov 21 '21 at 18:43

James Lecka

1
1

Note that this implementation *does* depend on the endianness of the C implementation, unlike the portable C in the question. Also, you're using non-standard type names like `_uint32` instead of C99 `uint32_t`. But yes, ISO C99 does define the behaviour of using a union this way, unlike ISO C++. (The GNU dialect of C++ does define the behaviour of union type-punning, as does MSVC++). Anyway, with optimization enabled, compilers compile this similar to shift/or code, using SHL/OR. (Except GCC11.2 uses DL/DH partial register stuff clunkily on x86-64: https://godbolt.org/z/Y7Kee6zba) – Peter Cordes Nov 21 '21 at 20:14
Note that `uint8_t`, if it exists at all, is guaranteed to be an 8-bit type with no padding. (Same for `uint32_t` being 32-bit no padding). Those types are optional. If you wanted to account for the possibility of `CHAR_BIT != 8` and concatenate four `char`s, you'd use an array of `char` and need some ifdef to figure out what integer type is the right size. e.g. some DSPs have CHAR_BIT = 16 or 24, same as `int` so `sizeof(int) == 1`. I'm not sure a union makes this easier at all. – Peter Cordes Nov 21 '21 at 20:18

When joining four 1-byte vars into one 4-byte word, which is a faster way to shift and OR ? (comparing generated assembly code)

4 Answers4

Can we do better with partial-register shenanigans?