3

Possible Duplicate:
What does the code do?

void duff(register char *to, register char *from, register int count)
{
    register int n=(count+7)/8;
    switch(count%8)
    {
        case 0: do{ *to++ = *from++;
        case 7:  *to++ = *from++;
        case 6: *to++ = *from++;
        case 5: *to++ = *from++;
        case 4: *to++ = *from++;
        case 3: *to++ = *from++;
        case 2: *to++ = *from++;
        case 1: *to++ = *from++;
        }while( --n >0);
    }
}

Is the above valid C code? If so, what is it trying to achieve and why would anyone do something like the above?

Community
  • 1
  • 1
josh
  • 13,793
  • 12
  • 49
  • 58
  • Better make `count` unsigned, or add `7U` instead of `7`. Otherwise the divide will be slow. – R.. GitHub STOP HELPING ICE Aug 02 '10 at 05:58
  • 1
    absolutely definite duplicate of [What does the code do?](http://stackoverflow.com/questions/1723270/what-does-the-code-do) – paxdiablo Aug 02 '10 at 06:04
  • 1
    @R..: Uh, no. You really think a compiler is too dumb to "optimize" integer division? – GManNickG Aug 02 '10 at 06:17
  • No, I think GMan is too dumb to realize that `(-2)%8` is -2 while `(-2)&7` is 6. Division/modulo by a power of 2 **cannot** be optimized to bitshift/bitwise-and when the operand might be negative, due to C's stupidly-specified behavior for division of negative numbers (which conflicts with the way most number-theorists and algabrists define division in a ring). Thus you always need to either write the bit operations yourself, or use unsigned operands. – R.. GitHub STOP HELPING ICE Aug 03 '10 at 05:45
  • 1
    @R: You severely underestimate your compiler (no difference for me on VS2010). We stopped hand-optimizing bit operations long ago. – GManNickG Aug 04 '10 at 05:36

3 Answers3

9

It's called Duff's device and you can read about it on wikipedia.

It takes care of one problem with an unrolled loop: there could be a non-integer number of passes needed. One method is to deal with this outside the main loop, but it's more efficient to use Duff's device which uses a very fast jump table and avoids extra looping overhead dealing with the odd number of operations.

In your example, which is a memory copy, please compare to the naive version:

void memcpy(char* dst, char* src, size_t count)
{
   begin:
     if (count-- == 0) return;
     *(dst++) = *(src++);
     goto begin;
}

To copy 15 bytes, this does the following:

test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count

Note how many times the "test count" and "loop" operations must be done.

Using duff's version which you showed, it is much simpler:

jump based on count, copy, copy, copy, copy, copy, copy, copy, test count, loop, copy, copy, copy, copy, copy, copy, copy, copy, test count

which saves over half the steps

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
5

It's valid. It's a very old-school loop unroll.

Basically, instead of checking the count for every character that's being copied to determine when to stop, it only has to check ceil(n/8) times.

The register keyword is just a compiler hint to suggest that the compiler try to keep that value in a register instead of shuffling it in and out of main memory.

Obviously stuff like this isn't really necessary anymore (memcpy() is likely to have a very fast implementation on whatever machine you're coding for) but tricks like this used to actually provide pretty decent performance wins.

fastcall
  • 2,564
  • 4
  • 20
  • 21
  • Using this implementation for `memcpy` is just silly, since the library will optimize by copying multiple bytes at once, instead of many one byte copies, but for other repetitive operations in which looping overhead is a lot of the run time, it can be a big improvement. For example, adding a constant to every element in an array. The addition is very fast, the looping logic actually is the biggest cost, so unrolling speeds things up a lot. – Ben Voigt Aug 02 '10 at 05:44
  • Duff's device might still win for very small memcpy operations that are rarely aligned, for example moving around small segments of strings. – R.. GitHub STOP HELPING ICE Aug 02 '10 at 05:55
  • 1
    +1 for stating it's unnecessary. The problem nowadays is far more "unreadable crappy unmaintainable code" than "slow code" :-) – paxdiablo Aug 02 '10 at 06:03
  • @R: You might use a Duff's device variant with no loop to take care of the preamble and postamble portions, but double or quad-word copies are still going to be preferred for the middle of the copy. Even unaligned double-word copy will be faster than individual byte copies on many processors. – Ben Voigt Aug 02 '10 at 06:06
  • My point was that special-casing the "middle of the copy" (which requires a branch or two) might be a net loss if most of your copies are so small and alignment is so random that there usually is no "middle". Obviously the duff's device approach is not the best for a general-purpose memcpy, but I can still envision rare special applications where it may be best. – R.. GitHub STOP HELPING ICE Aug 03 '10 at 05:51
  • @paxdiablo: I find that nowadays, "unreadable crappy unmaintainable code" and "slow code" are usually one in the same. – R.. GitHub STOP HELPING ICE Aug 03 '10 at 10:28
2

Obligatory link to Duff's Device on Wikipedia.

Bob Somers
  • 7,266
  • 5
  • 42
  • 46