13

I have simple C code that does this (pseudo code):

#define N 100000000
int *DataSrc = (int *) malloc(N);
int *DataDest = (int *) malloc(N);
memset(DataSrc, 0, N);
for (int i = 0 ; i < 4 ; i++) {
    StartTimer();
    memcpy(DataDest, DataSrc, N);
    StopTimer();
}
printf("%d\n", DataDest[RandomInteger]);

My PC: Intel Core i7-3930, with 4x4GB DDR3 1600 memory running RedHat 6.1 64-bit.

The first memcpy() occurs at 1.9 GB/sec, while the next three occur at 6.2 GB/s. The buffer size (N) is too big for this to be caused by cache effects. So, my first Question:

  • Why is the first memcpy() so much slower? Maybe malloc() doesn't fully allocate the memory until you use it?

If I eliminate the memset(), then the first memcpy() runs at about 1.5 GB/sec, but the next three run at 11.8 GB/sec. Almost 2x speedup. My second question:

  • Why is memcpy() 2x faster if I don't call memset()?
Evg
  • 25,259
  • 5
  • 41
  • 83
JB_User
  • 3,117
  • 7
  • 31
  • 51
  • 2
    Isn't it UB if you memcpy from an uninitialized source? What compiler are you using with what optimizations? Make the timings more reliably by increasing data size by 10x or more. – usr May 18 '14 at 14:53
  • 1
    @usr The data will be random, there is no ub as long as you don't use the data in a way that could introduce ub. There is no code in the example that would do that. – this May 18 '14 at 15:00
  • I'm using gcc with -O2. – JB_User May 18 '14 at 15:00
  • @self. sounds to me like reading an uninitialized variable which alone is enough to trigger UB. – usr May 18 '14 at 15:01
  • 1
    BTW: 11.8GB/s busspeed seems a bit too fast to me. – wildplasser May 18 '14 at 15:01
  • 1
    @usr Reading uninitialized variable doesn't trigger ub, using that value incorrectly does. For example using that value to access an array offset will trigger ub. I guess technically( standard ) you are correct. – this May 18 '14 at 15:03
  • Besides: "fresh" malloc() is probably built on mmap /dev/zero, so the memory *will* be initialissed. Setting it to all zeros will cause COW to actually clone and copy the pages anyway. – wildplasser May 18 '14 at 15:06
  • @self. I'm no expert on that but the top 3 hits disagree with you: https://www.google.com/webhp?complete=1&hl=en#complete=1&hl=en&q=read+uninitialized+variable+undefined+behavior Even if you are correct compilers still treat it as UB (hit 3). The Debian random number generator was crippled because someone thought it was a good idea to mix in an uninitialized variable. Was optimized out. – usr May 18 '14 at 15:07
  • @wildplasser the compiler is free to assume otherwise provided that this is in fact UB. – usr May 18 '14 at 15:07
  • 2
    That may be correct, but the OP specifically mentions gcc and linux. Besides: there are no trap represntations possible for ints (and the ints are never used, only copied) Otherwise reading random dat from an unknown disk file could also cause problems. – wildplasser May 18 '14 at 15:08
  • 1
    @usr Mr. Google is not an expert either. ;) – this May 18 '14 at 15:08
  • @self. hit 3 explicitly shows that GCC optimized out access to uninitialized variables. UB based benchmarks are very, very dodgy. – usr May 18 '14 at 15:09
  • 1
    Would be easy to verify : just check the generated assembly. – wildplasser May 18 '14 at 15:11

2 Answers2

13

As others already pointed out, Linux uses an optimistic memory allocation strategy.

The difference between the first and the following memcpys is the initialization of DataDest.

As you have already seen, when you eliminate memset(DataSrc, 0, N), the first memcpy is even slower, because the pages for the source must be allocated as well. When you initialize both, DataSrc and DataDest, e.g.

memset(DataSrc, 0, N);
memset(DataDest, 0, N);

all memcpys will run with roughly the same speed.

For the second question: when you initialize the allocated memory with memset all pages will be laid out consecutively. On the other side, when the memory is allocated as you copy, the source and destination pages will be allocated interleaved, which might make the difference.

Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
6

This is most likely due to lazy allocation in your VM subsystem. Typically when you allocate a large amount of memory only the first N pages are actually allocated and wired to physical memory. When you access beyond these first N pages then page faults are generated and further pages are allocated and wired in on an "on demand" basis.

As to the second part of the question, I believe some VM implementations actually track zeroed pages and handle them specially. Try initialising DataSrc to actual (e.g. random) values and repeat the test.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    +1 - 'Dirtying' (writing to) all the pages beforehand should indeed make things clear, one may try `calloc()`: http://stackoverflow.com/q/1538420/1175253 – Sam May 18 '14 at 15:10
  • 1
    @Sam: the top answer on that linked question was incorrect until I fixed it; `calloc` on most mainstream OSes gets zeroed pages from the kernel so they're still lazily allocated and will page-fault on read or write. – Peter Cordes Feb 19 '20 at 06:49