4

I want to measure memory bandwidth using memcpy. I modified the code from this answer:why vectorizing the loop does not have performance improvement which used memset to measure the bandwidth. The problem is that memcpy is only slighly slower than memset when I expect it to be about two times slower since it operations on twice the memory.

More specifically, I run over 1 GB arrays a and b (allocated will calloc) 100 times with the following operations.

operation             time(s)
-----------------------------
memset(a,0xff,LEN)    3.7
memcpy(a,b,LEN)       3.9
a[j] += b[j]          9.4
memcpy(a,b,LEN)       3.8

Notice that memcpy is only slightly slower then memset. The operations a[j] += b[j] (where j goes over [0,LEN)) should take three times longer than memcpy because it operates on three times as much data. However it's only about 2.5 as slow as memset.

Then I initialized b to zero with memset(b,0,LEN) and test again:

operation             time(s)
-----------------------------
memcpy(a,b,LEN)       8.2
a[j] += b[j]          11.5

Now we see that memcpy is about twice as slow as memset and a[j] += b[j] is about thrice as slow as memset like I expect.

At the very least I would have expected that before memset(b,0,LEN) that memcpy would be slower because the of lazy allocation (first touch) on the first of the 100 iterations.

Why do I only get the time I expect after memset(b,0,LEN)?

test.c

#include <time.h>
#include <string.h>
#include <stdio.h>

void tests(char *a, char *b, const int LEN){
    clock_t time0, time1;
    time0 = clock();
    for (int i = 0; i < 100; i++) memset(a,0xff,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    memset(b,0,LEN);
    time0 = clock();
    for (int i = 0; i < 100; i++) memcpy(a,b,LEN);
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);

    time0 = clock();
    for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
    time1 = clock();
    printf("%f\n", (double)(time1 - time0) / CLOCKS_PER_SEC);
}

main.c

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    tests(a, b, LEN);
}

Compile with (gcc 6.2) gcc -O3 test.c main.c. Clang 3.8 gives essentially the same result.

Test system: i7-6700HQ@2.60GHz (Skylake), 32 GB DDR4, Ubuntu 16.10. On my Haswell system the bandwidths make sense before memset(b,0,LEN) i.e. I only see a problem on my Skylake system.

I first discovered this issue from the a[j] += b[k] operations in this answer which was overestimating the bandwidth.


I came up with a simpler test

#include <time.h>
#include <string.h>
#include <stdio.h>

void __attribute__ ((noinline))  foo(char *a, char *b, const int LEN) {
  for (int i = 0; i < 100; i++) for(int j=0; j<LEN; j++) a[j] += b[j];
}

void tests(char *a, char *b, const int LEN) {
    foo(a, b, LEN);
    memset(b,0,LEN);
    foo(a, b, LEN);
}

This outputs.

9.472976
12.728426

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs

12.5
12.5

This leads me to to think this is a OS allocation issue and not a compiler issue.

#include <stdlib.h>

int tests(char *a, char *b, const int LEN);

int main(void) {
    const int LEN = 1 << 30;    //  1GB
    char *a = (char*)calloc(LEN,1);
    char *b = (char*)calloc(LEN,1);
    //GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.
    memset(b,1,LEN);
    tests(a, b, LEN);
}
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 3
    There are many things that can impact this under the hood. For example, unless you ensure that your allocations are properly aligned, intrinsics may or may not be used, resulting in variations in timings with no other changes in code. If you really want to pursue this, I think you'd be best served analyzing the assembly produced rather than looking at it at the C level. – David Hoelzer Apr 02 '17 at 13:06
  • @DavidHoelzer, you're right I should have looked at the assembly. I don't know why I did not. I usually do that. I just tried `memset` in main.c (separate object file) and it makes no difference. This says it must be a compiler issue and not a OS allocation issue. BTW, in my original tests where I found this (not in this question) the arrays were required to be 32 byte aligned. – Z boson Apr 02 '17 at 13:12
  • @DavidHoelzer, a quick look at the assembly and I can't see how `memset(b,0,LEN)` makes so much of a difference. Here is a simple version https://godbolt.org/g/z6EM2b. I tested this simple version and it's still too fast before `memset`. – Z boson Apr 02 '17 at 13:31
  • @DavidHoelzer, I take it back. This does appear to be a OS allocation issue. If I use `memset(b,1,LEN);` in main.c (write 1 instead of 0) then I get the times I expect. GCC optimizes `memset(b,0,LEN);` after `calloc` away (but Clang does not). – Z boson Apr 02 '17 at 13:37
  • 2
    `Then I initialized b to zero with memset(b,0,LEN) and test again:` If the memory was unitialized before (but obtained freshly via malloc), it will probably have been mapped to `/dev/zero` (expecting to be COWed later) . And dev/zero is very fast... and it will generate fewer cache misses. Best way to find out is by monitoring the RSS during the process – wildplasser Apr 02 '17 at 14:15
  • @wildplasser, they were allocated with `calloc` not `malloc` so they are suppose to be initialized to zero. I understand that `calloc` uses lazy clearing though I'm not sure what it means in this case. – Z boson Apr 02 '17 at 15:23
  • @wildplasser, the RSS before `memset(b,0,LEN)` is 1049668 and after 2098364. – Z boson Apr 02 '17 at 15:31
  • @wildplasser, could you write up an answer with more details than in your comment? Keep in mind that I used `calloc`. – Z boson Apr 02 '17 at 15:33
  • Looks caused by Copy-On-Write as @wildplasser mentioned. If you do in main: `memset(b, 1, fill_len);` and run the program a few times varying `fill_len` between `0` and `LEN (=1GiB)`, the time spent in the `memcpy` afterwards will increase as `fill_len` is bigger. – nnn Apr 02 '17 at 15:41
  • For *new* allocations (not from the free list) there is no difference between calloc and malloc, **IF** COW from /dev/zero is used. Not sure if brk/sbrk will also map /dev/zero in; (I guess it will) – wildplasser Apr 02 '17 at 15:52
  • Err... where is the disassembly? If I was a compiler spotting first `memset(a,0xff,LEN);` and then later `memcpy(a,b,LEN);`, I'd omit the former entirely. What happens if you change the call order of memset and memcpy? – Lundin Apr 03 '17 at 07:46
  • @Lundin, you can see the disassembly for a special case in the third comment above. It makes no significant difference if I swap the order (BTW I posted the working code so you can test it if you like). The reason is that `b` is COW i.e. it just points to a page of zeros until it's written to. So `memset(b,0xff,LEN);` makes a big difference but `memset(a,0xff,LEN);` does not because `memcpy(a,b,LEN)` reads from `b` but does not write to it. – Z boson Apr 03 '17 at 08:00
  • Hmm, well I would have thought the compiler would optimize away the call entirely. Out of curiosity, does `char * restrict a, char * restrict b` change anything? – Lundin Apr 03 '17 at 08:06
  • 1
    `restrict` makes now difference. I see your point about the compiler optimizing away the `memset` before `memcpy`. Neither GCC nor Clang do that and I don't know why. GCC does optimize `memset(0)` right after `calloc` away but Clang does not. – Z boson Apr 03 '17 at 08:11
  • @Lundin, I meant to write "no difference". – Z boson Apr 03 '17 at 11:29

2 Answers2

2

The point is that malloc and calloc on most platforms don't allocate memory; they allocate address space.

malloc etc work by:

  • if the request can be fulfilled by the freelist, carve a chunk out of it
    • in case of calloc: the equivalent ofmemset(ptr, 0, size) is issued
  • if not: ask the OS to extend the address space.

For systems with demand paging (COW) (an MMU could help here), the second options winds downto:

  • create enough page table entries for the request, and fill them with a (COW) reference to /dev/zero
  • add these PTEs to the address space of the process

This will consume no physical memory, except only for the Page Tables.

  • Once the new memory is referenced for read, the read will come from /dev/zero. The /dev/zero device is a very special device, in this case mapped to every page of the new memory.
  • but, if the new page is written, the COW logic kicks in (via a page fault):
    • physical memory is allocated
    • the /dev/zero page is copied to the new page
    • the new page is detached from the mother page
    • and the calling process can finally do the update which started all this
Z boson
  • 32,619
  • 11
  • 123
  • 226
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I edited your answer to clean up the some typos and added some links and formatting. I hope you don't mind. – Z boson Apr 03 '17 at 07:37
  • 1
    So I understand this now. Thanks. This is an optimization (which is the whole point of COW). If the memory is zero there is no need to waste space and it's also faster to read from a single zero page than several. It's interesting that GCC in this case converts `malloc` to `calloc` (but Clang does not) and `memset(0)` write after `malloc` is ignored. So the code gets the right answer unless the answer you are looking for is without the optimization. In general I should write random data to arrays and read that. – Z boson Apr 03 '17 at 07:41
  • Is it correct to say that calloc generates PTEs (in my case 1GB/4KB PTEs = 262144 PTEs) and each PTE points to the same page which contains zeros? – Z boson Apr 03 '17 at 07:46
  • It's long past due that I read this https://www.akkadia.org/drepper/cpumemory.pdf – Z boson Apr 03 '17 at 07:49
  • 1
    Each page will point to the same zero fllled memory page, and all will have COW status. (initially) And Calloc() doesn't do this, it only calls mmap() or sbrk, and the OS does the dirty work. – wildplasser Apr 03 '17 at 07:52
  • @wildplasser, "Calloc() doesn't do this, it only calls mmap() or sbrk" - which calloc? This one from linux glibc have more work: https://github.com/bminor/glibc/blob/73dfd088936b9237599e4ab737c7ae2ea7d710e1/malloc/malloc.c#L3170 and **sometimes** it will skip calling memset: https://github.com/bminor/glibc/blob/73dfd088936b9237599e4ab737c7ae2ea7d710e1/malloc/malloc.c#L3258 "Two optional cases in which clearing not necessary" (if the memory was `mmap`ed, not split from arena's top heap). @Zboson, are there docs for "`malloc+memset(0)`" to "`calloc`" optimization in gcc? – osgx Apr 03 '17 at 08:57
  • 1
    @osgx, I only know what I observe. [GCC drops the memset but Clang does not](https://godbolt.org/g/HDaxoo). And [here](https://godbolt.org/g/Wn8gYT) you can see that GCC converts malloc+memset to calloc. I said that GCC converts malloc to calloc even without memset but I don't see that now so I don't have evidence for that to show. – Z boson Apr 03 '17 at 10:10
  • 1
    @Zboson, probably, it is variant after "simplify_malloc_memset" from around 2014 https://gcc.gnu.org/ml/gcc-patches/2014-03/msg00076.html "*Re: calloc = malloc + memset*" (2013 gcc bug #57742 for 4.9/5.0 gcc); and gcc bug 67618 "*one case where this optimization really is invalid: when you're compiling an implementation of `calloc()`*". Code is in `gcc/tree-ssa-strlen.c`:`handle_builtin_memset` https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 (near update_gimple_call). Andi Kleen: "*... will break a large variety of micro benchmarks.*" – osgx Apr 03 '17 at 10:34
  • @osgx, "will break a large variety of micro benchmarks." That's basically what happened to me! But I don't think there is anything wrong with this. I mean a compiler should not avoid an optimization because it will break a benchmark. My test was quick and dirty. I wrote a bandwidth testing app 2 years ago and tested it and it got it correct anyway. – Z boson Apr 03 '17 at 11:04
1

Your b array probably was not written after mmap-ing (huge allocation requests with malloc/calloc are usually converted into mmap). And whole array was mmaped to single read-only "zero page" (part of COW mechanism). Reading zeroes from single page is faster than reading from many pages, as single page will be kept in the cache and in TLB. This explains why test before memset(0) was faster:

This outputs. 9.472976 12.728426

However, if I do memset(b,1,LEN) in main after calloc (see below) then it outputs: 12.5 12.5

And more about gcc's malloc+memset / calloc+memset optimization into calloc (expanded from my comment)

//GCC optimizes memset(b,0,LEN) away after calloc but Clang does not.

This optimization was proposed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742 (tree-optimization PR57742) at 2013-06-27 by Marc Glisse (https://stackoverflow.com/users/1918193?) as planned for 4.9/5.0 version of GCC:

memset(malloc(n),0,n) -> calloc(n,1)

calloc can sometimes be significantly faster than malloc+bzero because it has special knowledge that some memory is already zero. When other optimizations simplify some code to malloc+memset(0), it would thus be nice to replace it with calloc. Sadly, I don't think there is a way to do a similar optimization in C++ with new, which is where such code most easily appears (creating std::vector(10000) for instance). And there would also be the complication there that the size of the memset would be a bit smaller than that of the malloc (using calloc would still be fine, but it gets harder to know if it is an improvement).

Implemented at 2014-06-24 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57742#c15) - https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=211956 (also https://patchwork.ozlabs.org/patch/325357/)

  • tree-ssa-strlen.c ... (handle_builtin_malloc, handle_builtin_memset): New functions.

The current code in gcc/tree-ssa-strlen.c https://github.com/gcc-mirror/gcc/blob/7a31ada4c400351a35ab65f8dc0357e7c88805d5/gcc/tree-ssa-strlen.c#L1889 - if memset(0) get pointer from malloc or calloc, it will convert malloc into calloc and then memset(0) will be removed:

/* Handle a call to memset.
   After a call to calloc, memset(,0,) is unnecessary.
   memset(malloc(n),0,n) is calloc(n,1).  */
static bool
handle_builtin_memset (gimple_stmt_iterator *gsi)
 ...
  if (code1 == BUILT_IN_CALLOC)
    /* Not touching stmt1 */ ;
  else if (code1 == BUILT_IN_MALLOC
       && operand_equal_p (gimple_call_arg (stmt1, 0), size, 0))
    {
      gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
      update_gimple_call (&gsi1, builtin_decl_implicit (BUILT_IN_CALLOC), 2,
              size, build_one_cst (size_type_node));
      si1->length = build_int_cst (size_type_node, 0);
      si1->stmt = gsi_stmt (gsi1);
    }

This was discussed in gcc-patches mailing list in Mar 1, 2014 - Jul 15, 2014 with subject "calloc = malloc + memset"

with notable comment from Andi Kleen (http://halobates.de/blog/, https://github.com/andikleen): https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01818.html

FWIW i believe the transformation will break a large variety of micro benchmarks.

calloc internally knows that memory fresh from the OS is zeroed. But the memory may not be faulted in yet.

memset always faults in the memory.

So if you have some test like

   buf = malloc(...)
   memset(buf, ...) 
   start = get_time();
   ... do something with buf
   end = get_time()

Now the times will be completely off because the measured times includes the page faults.

Marc replied "Good point. I guess working around compiler optimizations is part of the game for micro benchmarks, and their authors would be disappointed if the compiler didn't mess it up regularly in new and entertaining ways ;-)" and Andi asked: "I would prefer to not do it. I'm not sure it has a lot of benefit. If you want to keep it please make sure there is an easy way to turn it off."

Marc shows how to turn this optimization off: https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01834.html

Any of these flags works:

  • -fdisable-tree-strlen
  • -fno-builtin-malloc
  • -fno-builtin-memset (assuming you wrote 'memset' explicitly in your code)
  • -fno-builtin
  • -ffreestanding
  • -O1
  • -Os

In the code, you can hide that the pointer passed to memset is the one returned by malloc by storing it in a volatile variable, or any other trick to hide from the compiler that we are doing memset(malloc(n),0,n).

Community
  • 1
  • 1
osgx
  • 90,338
  • 53
  • 357
  • 513