I want to optimize my function as much as possible and one of the things I did was use r8 as a pointer because that's the register the pointer gets pushed into in x64 funtions.
But would pushing RSI or RDI, moving the pointer to them and using them be faster later in a loop?
For example, mov [RSI],DL ;would complie to 2 bytes And: mov [r8],DL ; would complie to 3 bytes
So if I did a loop 100 to 200 times would r8 be slower because of the extra byte to decode? Or does pushing RSI and moving the pointer eliminate any possible speed increase? Obviously the push and mov would happen outside the loop.