Common widsom is that rep movsb
is much slower than rep movsd
(or on 64-bit, rep movsq
) when performing identical operations. However, I've been testing on a few modern machines, and the run times are coming out identical (up to measurement noise) across a huge range of buffer sizes (10 bytes to 2 megs). So far I have just tested on 2 machines (32-bit Intel Atom D510 and 64-bit AMD FX 8120).
Are there any modern x86 (32- or 64-bit) machines where
rep movsb
is slower thanrep movsd
(orrep movsq
)?If not, what was the last machine where the difference was significant, and how significant was it?
I'm asking this question from a standpoint of wanting to avoid cargo-culting a bunch of tests to break memory up into unaligned head/tail and aligned middle for the sake of using rep movsd
or rep movsq
if there's no actual benefit to doing this...