Optimizing XOR of `char` with the highest byte of `int`

Question

Let we have int i and char c.

When use i ^= c the compiler will XOR c with the lowest byte of i, and will translate the code to the single processor instruction.

When we need to XOR c with highest byte of i we can do something like this:

i ^= c << ((sizeof(i) - sizeof(c)) * 8)

but the compiler will generate two instructions: XOR and BIT-SHIFT.

Is there a way to XOR a char with the highest byte of an int that will be translated to single processor instruction in C++?

On x86, if `i` is in memory, you can certainly emit a single architectural instruction instruction to XOR an operand with its highest byte. However, if the operand has to be in memory to begin with, and the resulting (one) instruction uses three micro-ops, it is questionable whether that is a situation that is relevant to "optimize" to begin with. Certainly, a shift instruction and a reg-reg xor are faster. What are your constraints? — Dolda2000, Mar 10 '19 at 19:21
If `i` is in memory and `c` is in a register, then certainly `xor %al, 3(%rbx)` would do the trick (assuming `AL` contains `c` and `RBX` contains the address of `i`), and @Artyer's answer would probably compile into that instruction. However, as I said, is that really a situation worth optimizing for? Again, if you can have `i` in a register, then a shift instruction plus a reg-reg xor would be a lot faster, despite being two architectural instructions. — Dolda2000, Mar 10 '19 at 19:36
Don't worry to much about location of "i". I have declared "register int i", and when i compile for "release" it seems all instruction are reg-reg. "i ^= c" is a little faster than xor+shift. — NoSkill, Mar 10 '19 at 20:08
The location of `i` certainly matters very much. For instance, Artyer's answer below probably does produce the one instruction that you claim to be looking for, but it would also force `i` to memory (the `register` constraint on variables means very little to modern compilers), which would certainly degrade your actual performance. — Dolda2000, Mar 10 '19 at 20:10
`single processor instruction` is too broad. There are tons of architectures out there, each with instructions in varying complexity. For example ARM has the ability of shifting in any instructions, so that can translate to a single [`eor r2, r2, r1, lsl #24`](https://godbolt.org/z/GKGmre) instruction. And why do you care about shift and xor which are dirt cheap? This is just premature optimization — phuclv, Mar 11 '19 at 15:50
@phuclv there are critical part of the crypto-code which process a lot of data. And algo needs to xor data with lowest and highest byte of int many times. As I know "addressing" highest byte of register supported on all processors, but disappointed that C/C++ has no ability this type optimizations. So I have to rely compilers, that are not all smart enough. — NoSkill, Mar 11 '19 at 21:24
they're really smart. They don't do that simply because there's no instruction in the architecture that they can use for this purpose. Any CPUs can *address* any byte in memory, but most of them can't operate on specific bytes in register except the LSB, therefore no chance for the compiler to optimize. If you deal with a lot of data a lot of times then it's the job of the GPU, SIMD unit or multi threading — phuclv, Mar 12 '19 at 00:22

score 1 · Answer 1 · answered Mar 10 '19 at 19:08

1

If you are confident about the byte-order of the system, for example by checking __BYTE_ORDER__ or equivalent macro on your system, you can do something like this:

#if // Somehow determing if little endian, so biggest byte at the end
    *(&reinterpret_cast<char&>(i) + sizeof i - 1) ^= c
#else
    // Is big endian, biggest byte at the beginning
    reinterpret_cast<char&>(i) ^= c
#endif

answered Mar 10 '19 at 19:08

Artyer

31,034
3
47
75

Thanks for answer. But is there platform independent solution for addressing highest byte of object in C++? – NoSkill Mar 10 '19 at 20:15
1

but this requires `i` to be in memory and not register – phuclv Mar 11 '19 at 15:41
@phuclv Oh, yes! In this case it's no reg-reg operation. – NoSkill Mar 11 '19 at 21:29

phuclv · Answer 2 · 2020-10-01T16:02:03.510

Compilers are really smart about such simple arithmetic and bitwise operations. They don't do that simply because they can't, as there are no such instructions on those architectures. It doesn't worth wasting valuable opcode space for rarely used operations like that. Most operations are done in the whole register anyway, and working on just a part of a register is very inefficient for the CPU, because the out-of-order execution or register renaming units will need to work a lot harder. That's the reason why x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register, or why modifying the low part of the register in x86 (like AL or AX) can be slower than modifying the whole RAX. INC can also be slower than ADD 1 because of the partial flag update

That said, there are architectures that can do combined SHIFT and XOR in a single instruction like ARM (eor rX, rX, rY, lsl #24), because ARM designers spent a big part of the instruction encoding for the predication and shifting part, trading off for a smaller number of registers. But again your premise is wrong, because the fact that something can be executed in a single instruction doesn't mean that it'll be faster. Modern CPUs are very complex because every instruction has different latency, throughput and the number of execution ports. For example if a CPU can execute 4 pairs of SHIFT-then-XOR in parallel then obviously it'll be faster than another CPU that can run 4 single SHIFT-XOR instructions sequentially, provided that clock cycle is the same

This is a very typical XY problem, because what you thought is simply the wrong way to do. For operations that need to be done thousands, millions of times or more then it's the job of the GPU or the SIMD unit

For example this is what the Clang compiler emits for a loop XORing the top byte of i with c on an x86 CPU with AVX-512

vpslld  zmm0, zmm0, 24                             # shift
vpslld  zmm1, zmm1, 24
vpslld  zmm2, zmm2, 24
vpslld  zmm3, zmm3, 24
vpxord  zmm0, zmm0, zmmword ptr [rdi + 4*rdx]      # xor
vpxord  zmm1, zmm1, zmmword ptr [rdi + 4*rdx + 64]
vpxord  zmm2, zmm2, zmmword ptr [rdi + 4*rdx + 128]
vpxord  zmm3, zmm3, zmmword ptr [rdi + 4*rdx + 192]

By doing that it achieves 16 SHIFT-and-XOR operations with just 2 instruction. Imagine how fast that is. You can unroll more to 32 zmm registers to achieve even higher performance until you saturate the RAM bandwidth. That's why all high-performance architectures have some kind of SIMD which is easier to do fast parallel operations, rather than a useless SHIFT-XOR instruction. Even on ARM with a single-instruction SHIFT-XOR then the compiler will be smart enough to know that SIMD is faster than a series of eor rX, rX, rY, lsl #24. The output is like this

shl     v3.4s, v3.4s, 24       # shift
shl     v2.4s, v2.4s, 24
shl     v1.4s, v1.4s, 24
shl     v0.4s, v0.4s, 24
eor     v3.16b, v3.16b, v7.16b # xor
eor     v2.16b, v2.16b, v6.16b
eor     v1.16b, v1.16b, v4.16b
eor     v0.16b, v0.16b, v5.16b

Here's a demo for the above snippets

That'll be even faster when running in parallel in multiple cores. The GPU is also able to do very high level or parallelism, hence modern cryptography and intense mathematical problems are often done on the GPU. It can break a password or encrypt a file faster than a general purpose CPU with SIMD

score 0 · Answer 3 · edited Oct 01 '20 at 22:47

Don't assume that the compiler will generate a shift with the above code. Most modern compilers are smarter than that:

int i = 0;   // global in memory
char c = 0;

void foo ()
{
    c ^= (i >> 24);
}

Compiles (https://godbolt.org/z/b6l8qk) with GCC (trunk) -O3 to this asm for x86-64:

foo():
        movsx   eax, BYTE PTR i[rip+3]         # sign-extending byte load
        xor     BYTE PTR c[rip], al            # memory-destination byte xor
        ret

clang, ICC, and MSVC all emit equivalent code (some using a movzx load which is more efficient on some CPUs than movsx).

This only really helps if the int is in memory, not a register, which is hopefully not the case in a tight loop unless it's different integers (like in an array).

Checked there some other compilers and it seems not all are so smart. Anyway thanks for the link. — NoSkill, Mar 11 '19 at 08:55

Optimizing XOR of `char` with the highest byte of `int`

3 Answers3