2

Lets swap 2 variables.

int temp = a;
a = b;
b = temp;

Here's some half-optimized asm pseudocode:

mov eax, dword ptr [rbp+4]
mov ebx, dword ptr [rbp+8]
mov dword ptr [rbp+8], eax
mov dword ptr [rbp+4], ebx

Would it be faster to xor the objects with eachother?

a ^= b ^= a ^= b;

asm psuedocode:

mov eax, dword ptr[rbp+4]
xor eax, dword ptr[rbp+8]
xor dword ptr[rbp+8], eax
xor eax, dword ptr[rbp+8]
xor dword ptr[rbp+8], eax
mov eax, dword ptr[rbp+4]

Which of these would be faster? (Guestimates welcome)

nrz
  • 10,435
  • 4
  • 39
  • 71
J V
  • 11,402
  • 10
  • 52
  • 72
  • 1
    it would also depend on what CPU this was running on. – built1n Apr 03 '14 at 20:18
  • This question is moot, considering none of the instructions in question can operate directly on two memory operands. – Ben Voigt Apr 03 '14 at 20:21
  • 1
    Wait really? Well that rules out `xchg` in any case, I'll remake my pseudocode to `mov` to registers first – J V Apr 03 '14 at 20:23
  • I did not measure, but I would guess that memory accesses are much slower than register operations, so the first should be faster. Another thought: if your swap is inside a loop, unrolling the loop once might completely get rid of the need for the variable swap. – Willem Hengeveld Apr 03 '14 at 20:30
  • If the memory was in L1 cache, would the `xor` instruction be faster enough than mov to make up for the 2 extra cache hits in the second version? – J V Apr 03 '14 at 20:33
  • 1
    Your "half-optimized asm pseudocode" doesn't even work. It never stores a value to memory. I think the order of operands in your last two instructions is reversed (i.e. should be `mov dword ptr[rbp+8],eax`). In addition, it's unlikely that you can do 6 memory accesses faster than you can do 4. I strongly suspect that the first example, once you fix it, will be faster than the second. – Jim Mischel Apr 03 '14 at 20:37
  • Your first example doesn't actually work as written, but once fixed I am pretty sure that version will be the fastest of the two. – 500 - Internal Server Error Apr 03 '14 at 20:37
  • Could have sworn I changed that already – J V Apr 03 '14 at 20:39
  • Related: [Why is the XOR swap optimized into a normal swap using the MOV instruction?](https://stackoverflow.com/q/71382441) - GCC / clang will disentangle xor-swap in the source into just plain swap. – Peter Cordes Mar 07 '22 at 21:00

3 Answers3

3

Pulling it into two registers then writing back swapping the contents is likely the fastest of the solutions. Four memory cycles, four instructions, two registers. assuming the data has to start in and return to ram, then you cant beat this approach in general.

Four xors assuming you could do memory for sources and destinations, is three cycles per xor, 12 memory cycles, that is a definite loser. using registers to avoid two mem operands just adds more instructions.

Your asm pseudocode is 6 memory cycles. 6 instructions one register. The four cycles, four instructions two registers is likely cheaper. Now if you have to do two memory cycles to free up those registers it becomes 6 cycles. where this last one would be one additional to free up the register so 7. 6 is still cheaper than 7 and 5 instructions cheaper than 7, instruction size was not counted here but adds to memory cycles although fetching is likely done in an efficient manner (in good sized aligned chunks).

If the data were already in registers, then using a third register and doing the three instruction tmp = a, a = b, b = tmp is three operations three registers and the fastest. But if you simply cant spare a register then four xors is faster.

Thats all a generic high level view, there are likely processors and cache situations, etc that can make one solution that appears to be faster not end up being faster certainly for one test but perhaps in general depending on the situation.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • x86 has `xchg` instruction to swap values so you don't need another free temp register – phuclv Apr 04 '14 at 16:54
  • 1
    if it is all in registers then there you go that is the best way, if it has to start and end in ram then exchanging reg contents is just a wasted instruction. – old_timer Apr 04 '14 at 17:31
3

There is no reason why the Xor method would be faster, on any machine.

Both methods need to perform two reads and two writes, and the Xor method has ALU+memory overhead.

1

On processors supporting register move-elimination (e.g. - IvyBridge or later generation), the fastest way should be the first (using a temp variable) if you can make the compiler keep these values in registers (you'll have to check the generated assembly to make sure).

This way you avoid not only the memory accesses (although a read-after-write should get forwarded internally, but you still accumulate latencies in the memory unit), you also avoid execution latency. The CPU would simply switch the pointers of the registers themselves in the out-of-order register renamer.

Even without move elimination, register-only moves should be faster. The memory unit has tons of restrictions it has to enforce (collision checks, cache lookups, etc..), longer pipeline and less bandwidth the a regular execution.

Leeor
  • 19,260
  • 5
  • 56
  • 87