Compilers are really smart about such simple arithmetic and bitwise operations. They don't do that simply because they can't, as there are no such instructions on those architectures. It doesn't worth wasting valuable opcode space for rarely used operations like that. Most operations are done in the whole register anyway, and working on just a part of a register is very inefficient for the CPU, because the out-of-order execution or register renaming units will need to work a lot harder. That's the reason why x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register, or why modifying the low part of the register in x86 (like AL or AX) can be slower than modifying the whole RAX. INC
can also be slower than ADD 1
because of the partial flag update
That said, there are architectures that can do combined SHIFT and XOR in a single instruction like ARM (eor rX, rX, rY, lsl #24
), because ARM designers spent a big part of the instruction encoding for the predication and shifting part, trading off for a smaller number of registers. But again your premise is wrong, because the fact that something can be executed in a single instruction doesn't mean that it'll be faster. Modern CPUs are very complex because every instruction has different latency, throughput and the number of execution ports. For example if a CPU can execute 4 pairs of SHIFT-then-XOR in parallel then obviously it'll be faster than another CPU that can run 4 single SHIFT-XOR instructions sequentially, provided that clock cycle is the same
This is a very typical XY problem, because what you thought is simply the wrong way to do. For operations that need to be done thousands, millions of times or more then it's the job of the GPU or the SIMD unit
For example this is what the Clang compiler emits for a loop XORing the top byte of i
with c
on an x86 CPU with AVX-512
vpslld zmm0, zmm0, 24 # shift
vpslld zmm1, zmm1, 24
vpslld zmm2, zmm2, 24
vpslld zmm3, zmm3, 24
vpxord zmm0, zmm0, zmmword ptr [rdi + 4*rdx] # xor
vpxord zmm1, zmm1, zmmword ptr [rdi + 4*rdx + 64]
vpxord zmm2, zmm2, zmmword ptr [rdi + 4*rdx + 128]
vpxord zmm3, zmm3, zmmword ptr [rdi + 4*rdx + 192]
By doing that it achieves 16 SHIFT-and-XOR operations with just 2 instruction. Imagine how fast that is. You can unroll more to 32 zmm registers to achieve even higher performance until you saturate the RAM bandwidth. That's why all high-performance architectures have some kind of SIMD which is easier to do fast parallel operations, rather than a useless SHIFT-XOR instruction. Even on ARM with a single-instruction SHIFT-XOR then the compiler will be smart enough to know that SIMD is faster than a series of eor rX, rX, rY, lsl #24
. The output is like this
shl v3.4s, v3.4s, 24 # shift
shl v2.4s, v2.4s, 24
shl v1.4s, v1.4s, 24
shl v0.4s, v0.4s, 24
eor v3.16b, v3.16b, v7.16b # xor
eor v2.16b, v2.16b, v6.16b
eor v1.16b, v1.16b, v4.16b
eor v0.16b, v0.16b, v5.16b
Here's a demo for the above snippets
That'll be even faster when running in parallel in multiple cores. The GPU is also able to do very high level or parallelism, hence modern cryptography and intense mathematical problems are often done on the GPU. It can break a password or encrypt a file faster than a general purpose CPU with SIMD