In the special case of a 4-byte string like your example, use rol dword ptr [arr], 8
to do the rotate you described.
(Remember that x86 is little-endian, so left-shifts inside a multi-byte operand move bytes to higher addresses).
In non-special cases, just implement a memmove()
with a normal copy loop to shift bytes over, and copy the byte that has to wrap around. (You might want to load the byte that wraps around before entering the copy loop, so you can overwrite the location where it was stored.)
The best way to do this (for performance) is probably with SSE movups
. rep movsb
has high startup overhead, and is slower on misaligned data. And probably doesn't work well with overlapping destinations, but I don't remember seeing that mentioned.
If that wasn't what you meant by "best", be more specific and say "easiest to understand" or something.