0

I am working through "Data Structures and Program Design in C" by Kruse, Leung, and Tondo. Chapter 2, Section 2 presents a simple data structure (a list) and later asks the reader to write two different copy functions for the structure. The exercise in question is E2 and reads as follows:

Write functions that will copy one list to another list (as the structure type defined in the text). Use the following methods: (a) copy the entire structures; (b) use a loop to copy only the entries. Which version is easier to write? Which version will usually run faster, and why?

My question comes in with the last part of this exercise: "Which version will usually run faster, and why?". I am aware that copying the entire structure is faster, and my understanding suggests that this is because one can avoid the overhead of a loop. However, when I ran each I was surprised to find that copying the entire structure was not only faster but roughly 10x faster than copying each entry.

I was hoping someone may be able to explain to me why this is, or at least direct me to a source that could aid in my understanding. I tried reading through the assembly of the two functions I wrote, but my understanding of assembly is very basic at this point.

Thank you!

Relevant code:

#define MAXLIST 200 //maximum size of lists

extern void Error(const char *);

typedef struct coord_tag {
    int row;    //x
    int col;    //y
} Coord_type;

typedef Coord_type Entry_type;

typedef struct list_tag {
    int count;
    Entry_type entry[MAXLIST];
} List_type;

void Copy(List_type *lt, List_type *lf) //list "to" and list "from"
{
    if (lt == NULL || lf == NULL) {
        Error("list uninitialized");
    } else {
        *lt = *lf;
    }
}

void Copy2(List_type *lt, List_type *lf)
{
    if (lt == NULL || lf == NULL) {
        Error("list uninitialized");
    } else {
        int i;

        lt->count = lf->count;
        for (i = 0; i < lf->count; i++) {
            lt->entry[i] = lf->entry[i];
        }
    }
}
ikegami
  • 367,544
  • 15
  • 269
  • 518
upgrayedd
  • 129
  • 11
  • 1
    Look at the generated assembly. – ikegami Sep 01 '14 at 18:47
  • 1
    If you want help understanding the assembly, you should post it ;) – mafso Sep 01 '14 at 18:51
  • 1
    How was this compiled? – mafso Sep 01 '14 at 18:55
  • try comparing it with memcpy, the first copy is equivalent to memcpy. When there is a for-loop compiler may not know that those entries are for sure in a consecutive memory location and has to go through for loop, but there are assembly instructions for whole-sale memory copy. – dashesy Sep 01 '14 at 19:04
  • Two suggestions: 1. rerun your test with different levels of optimization (on non-Windows systems, that's usually achieved with the `-O1`, `-O2`, `-O3`, or `-Os` options). 2. To get a deeper understanding of what happens under the hood, put `Copy2()` into a source file of its own and compile with the compiler flags `-Os -S`. This will give you a file with a ".s" filename extension containing the assembly code. The optimization flag is not strictly necessary, but in my experience, it increases the readability of the assembly code quite significantly. – cmaster - reinstate monica Sep 01 '14 at 19:18
  • try Copy2 with `restrict` keyword. – Jonatan Goebel Sep 01 '14 at 20:42

1 Answers1

1

You'd be surprised at how fast a straight memory copy can be! In assembly, there are instructions dedicated to fast memory copy. (e.g. REP MOVSB) Let's look at all the new interruptions the second copy introduces in each loop iteration:

  • i++
    • caches the original value of i
    • increments i in memory
    • finally returns the original value of i
  • lf->entry[i]
    • retrieves value of lf
    • retrieves value of i
    • adds i plus the offset of entry to lf
    • retrieves value at that address
  • lt->entry[i]
    • retrieves value of lt
    • retrieves value of i
    • adds i plus the offset of entry to lt
  • i < lf->count
    • retrieves value of lf
    • retrieves value of lf->count
    • retrieves value of i
    • compares i to lf->count

You can imagine why this would be 10x slower than an uninterrupted memory copy.

ldgabbay
  • 1,133
  • 8
  • 9
  • If any optimization is applied, the values of `i`, `lt`, and `lf` should always reside in registers. So the operation of "retrieves value of ..." is a noop. I don't think, the OP applied optimization, though... – cmaster - reinstate monica Sep 01 '14 at 19:13
  • 1
    I think it depends on your compiler. Nowadays with gcc, you should set `__restrict` on the arguments to the function so the compiler knows it doesn't have to check the value of the arguments every time. http://stackoverflow.com/questions/1965487/does-the-restrict-keyword-provide-significant-benefits-in-gcc-g – ldgabbay Sep 01 '14 at 19:25
  • All three variables that I mentioned are local variables in the function. The pointers are passed by value, so they never need to be reloaded. – cmaster - reinstate monica Sep 01 '14 at 19:34
  • No, you're right that it can keep i, lt, and lf in registers. I'm just saying the __restrict keyword on lt and lf could tell the optimizer that it can assume that the data behind those pointers won't change behind the scenes during the call. This allows the optimizer to cache lf->count, for example. – ldgabbay Sep 01 '14 at 20:04
  • @ldgabbay Thank you, this is exactly what I was looking for. I was trying to break down the steps in my mind at machine-level, but didn't bring it far enough to realize all of the operations that would be involved in the for loop. – upgrayedd Sep 01 '14 at 21:12
  • @upgrayedd I'll give you a couple of tips I use, although many compilers do this for you. First, if you don't care about the returned result, use `++i` over `i++`. Second, if you can use pointers to iterate through lists, do it over array indexing. a single `*(++ptr)` is faster than `arr[++i]` because of the extra pointer addition of `arr + i` that happens under the covers. Third, cache the loop stopping value. `lf->count` has to be pulled from RAM, so put it in a local variable, which will go into a register. Fourth, `i != length` is often slightly faster than `i < length`. – ldgabbay Sep 02 '14 at 16:44