Why a second call to memset is much faster than the first one?

Question

I guess it's because of some kind of cache, but i keep wondering.

When using memset a second time on the same memory, the code run way much faster.

let's take this piece of code :

> for_stackoverflow.c

#include <string.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>

int main() 
{
  long long size = 1 << 29;
  int i;
  char * mem;
  clock_t start, end;
  double how_long;  
  
  mem = malloc(size);

  start = clock(); {
    memset(mem, 0xde, size);
  } end = clock();

  how_long = ((double) (end - start)) / CLOCKS_PER_SEC;
  printf("%f\n", how_long);
  
  start = clock(); {
    memset(mem, 0xad, size);   
  } end = clock();

  how_long = ((double) (end - start)) / CLOCKS_PER_SEC;
  printf("%f\n", how_long);
  
  free(mem);

  return 0;
}

and compile it wihtout any optimisation (otherwise memset will get skipped, i think this because it will not appear if doing an objdump objdump -dw a.out)) :

$ gcc -O0 for_stackoverflow.c

the result on my machine is (gnu/linux ubuntu 20.04 intel i5) :

$ ./a.out 
0.244824
0.044119

...And the second memset ran lighting fast compared to the first one.

I can get some curious result, like putting half the size makes it even :

    memset(mem, 0xad, size); -> memset(mem, 0xde, size/2);
    memset(mem, 0xad, size); -> memset(mem, 0xad, size);   

$./a.out
0.145827
0.141934

This makes me think that running a second time on a memory is much much faster.

I am confused on what kind of optimisation occurs in this example. Is there someone with an idea on what's happening ?

Help would be much appreciated

page-faults, cache, and TLB, and also CPU frequency warm-up. Also, maybe even lazy dynamic linking overhead on the very first call! But as you showed with your 2nd test, page faults from lazy allocation make the first touch of every page a lot more expensive so that dominates the cost. — Peter Cordes, Nov 27 '20 at 15:49
This is one of the rare cases where disabling optimization (`-O0`) doesn't really matter; the vast majority of the the work happens in a hand-written asm library function. — Peter Cordes, Nov 27 '20 at 15:52
@PeterCordes It could even be counter productive to use `-O3` since the compiler may optimize out the call to `memset()`. Edit: it is, time reduces to a few µs for both parts with `-O1`. — 12431234123412341234123, Nov 27 '20 at 15:58
@12431234123412341234123 - Yeah, GCC understands that `malloc` returns memory that nothing else points to, so it can prove that the stores are "dead" and optimize them away. `gcc -O3 -fno-builtin-malloc` might generate the function-calls you want to benchmark with more efficient calls in between, if the first memset didn't still get treated as dead stores. (Probably not because this program makes some function calls between the two timed regions.) — Peter Cordes, Nov 27 '20 at 16:12

Why a second call to memset is much faster than the first one?

0 Answers0