If we take the implementation from the Wikipedia article you link...
send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8){
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}while(--n>0);
}
}
...and replace the "high-level" do
/ while
loop with the assembly-level if
/ goto
that the compiler really reduces it to...
send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8){
case 0: do_label: *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
if (--n>0) goto do_label;
}
}
...it may help you perceive that - in this usage where the do/while scope didn't introduce any local variables - there's really nothing more to the do-while than a jump back to case 0 that bypasses the switch and hence the need to evaluate count % 8 (% is a pretty expensive operation in the scheme of things).
Hope that helps it click, but may not...? :-)
Why did duff use 8? It could be 16, 65536 or whatever. Because of code size? Is there another reason? For example, cache or pipelining benefits.
Just a case of diminishing returns. Having to do the --n > 0
check and a jump after every 8 data copies isn't a big percentage overhead, but the size of the code (both in source and as compiled code in cache) is still pretty tight. Maybe it'd be 90 or 95% work vs overheads, which was evidently good enough. Further, to illustrate and share the concept with others Tom Duff may have prefered it to be about a typical 80x25 terminal's worth of code rather than a page or 10.