13

I need help with the performance of the following code. It does a memcpy on two dynamically allocated arrays of arbitrary size:

int main()
{
  double *a, *b;
  unsigned n = 10000000, i;
  a = malloc(n*sizeof(double));
  b = malloc(n*sizeof(double));
  for(i=0; i<n; i++) {
    a[i] = 1.0;
    /* b[i] = 0.0; */
  }

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero1");

  tic();
  bzero(b, n*sizeof(double));
  toc("bzero2");

  tic();
  memcpy(b, a, n*sizeof(double));
  toc("memcpy");
}

tic/toc measure the execution time.

On my computer it takes 0.035s to memcpy (Linux, gcc version 4.4.6). If I now uncomment the line which initializes the destination array b, the code is three times faster (!) - 0.011s.

I have observed similar behavior when using a loop instead of memcpy. Usually I do not care about this since it is enough to 'initialize' the memory before using it. However, I now need to perform a simple memory copy, and do it as fast as possible. Initializing the data requires writing e.g. 0 to the memory, which is not necessary and takes time. And I would like to perform a memory copy with all available memory bandwidth.

Is there a solution to this problem? Or is it connected to the way Linux handles dynamic memory (some sort of lazy page allocation?) and can not be worked around? How is it on other systems?

Edit: The same results are obtained with gcc 4.6. I used -O3 to compile.

Edit: Thank you all for your comments. I do understand that memory mapping takes time. I guess I just have a hard time accepting that it takes so long, much longer than the actual memory access. The code has been modified to include a benchmark of the initialization of array b using two subsequent bzero calls. The timings now show

bzero1 0.273981
bzero2 0.056803
memcpy 0.117934

Clearly, the first bzero call does much more than just stream zeros to memory - that is memory mapping and memory zeroing. The second bzero call on the other hand takes half of the time required to do a memcpy, which is exactly as expected - write only time vs. read and write time. I understand that the overhead of the second bzero call must be there due to OS security reasons. What about the rest? Can I not decrease it somehow, e.g. use larger memory pages? Different kernel settings?

I should mention that I run this on Ubuntu wheeze.

angainor
  • 11,760
  • 2
  • 36
  • 56
  • 2
    Did you compile with a recent GCC compiler (e.g. `gcc-4.7`) and did you enable optimizations (e.g. `-O3`)? – Basile Starynkevitch Sep 03 '12 at 16:18
  • 4
    I wonder if it has to do with the cache; initializing b causes it to be loaded into cache before your timer is active. – mah Sep 03 '12 at 16:21
  • @mah It is not cache because b is 80MB. – usr Sep 03 '12 at 16:33
  • 1
    Somewhat Related: [Why does malloc initialize the values to 0 in gcc?](http://stackoverflow.com/questions/8029584/why-does-malloc-initialize-the-values-to-0-in-gcc) - Your allocation is large enough where it will invoke lazy allocation and incur zeroing overhead. – Mysticial Sep 03 '12 at 16:35
  • What are the definitions of `tic` and `toc`? Unless you're very careful, the compiler may have the freedom to reorder them... – R.. GitHub STOP HELPING ICE Sep 03 '12 at 17:27
  • 1
    I would suggest asking your follow up question about reducing the time taken to map memory as a new question, and linking to this question. This is because it doesn't really match the tags c, memcpy or malloc anymore, and also this question now has a bunch of answers and comments that are not relevant to your new question. – verdesmarald Sep 03 '12 at 17:42

3 Answers3

10

It's probably lazy page allocation, Linux only mapping the pages on first access. IIRC each page in a new block in Linux is a copy-on-write of a blank page, and your allocations are big enough to demand new blocks.

If you want to work around it, you could write one byte or word, at 4k intervals. That might get the virtual addresses mapped to RAM slightly faster than writing the whole of each page.

I wouldn't expect (most efficient workaround to force the lazy memory mapping to happen) plus (copy) to be significantly faster than just (copy) without the initialization of b, though. So unless there's a specific reason why you care about the performance just of the copy, not of the whole operation, then it's fairly futile. It's "pay now or pay later", Linux pays later, and you're only measuring the time for later.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
  • With my GNU/Linux, the first runs faster than the second. – md5 Sep 03 '12 at 16:32
  • @Kirilenko: fair enough. Out of (A) init every byte (B) init one byte per page, and (C) init nothing: what's "the first", "the second", and the times involved? – Steve Jessop Sep 03 '12 at 16:34
  • The code of the OP (commented line), takes about 0.17s, and when I uncomment the line, 0.21s. – md5 Sep 03 '12 at 16:49
  • @Steve Jessop I guess I was afraid it would be so. Initialization of only one value in every page does not help - the initialization part is two times slower than the memcpy part, which by the way has to both read and write the data. So even if lazy allocation would require zeroing of the array, it would only explain a quarter of the overhead, i.e. 0.05s (half of the time required to memcpy on initialized arrays). – angainor Sep 03 '12 at 16:51
  • This has nothing to do with overcommit. Commit accounting and instantiation of physical pages to back virtual memory are two completely separate areas of memory management. They could in principle be combined, but doing so would give much worse cache utilization and overall worse performance. – R.. GitHub STOP HELPING ICE Sep 03 '12 at 17:05
  • @R..: OK, so there's no overcommit strategy that says, "back and map the physical pages before `malloc` returns, notwithstanding that for typical code this is slow and wasteful"? In that case I'll remove the remark about overcommit. – Steve Jessop Sep 03 '12 at 17:14
  • @Kirilenko: ah, so you're getting the opposite result to the questioner -- for you, initializing the memory *slows down* the `memcpy`. I don't have a ready explanation for that. – Steve Jessop Sep 03 '12 at 17:17
  • @Steve: One possible implementation of commit charge accounting would be to make a system without COW. This is what you're describing. On such a system, `fork` would be atrociously slow, and memory would be full of duplicate pages and have no room for caching files. This is why nobody does it. As soon as you want both COW and commit charge accounting, you have to do the commit charge accounting separate from page allocation. – R.. GitHub STOP HELPING ICE Sep 03 '12 at 17:24
  • @R..: sees a bit drastic to me: in principle the thing I'm describing could be done by having COW, but also having an implementation of `malloc` that writes the memory (thus forcing it to be mapped). `fork` would be unaffected, but I guess the structure of GNU/Linux is such that (a) commit strategy can't affect `malloc` because `malloc` is in glibc, so won't collude in my strategy, and (b) allocation of new memory for return from `malloc` can't be separated from copying process space for `fork`, so there's no way to implement the strategy I described sensibly. Is that correct? – Steve Jessop Sep 03 '12 at 17:27
  • @Steve: The `malloc` strategy you describe is perfectly valid and possible to implement; the only thing I see as a mistake in your reasoning is thinking that it's related to overcommit. Whether you want overcommit and whether you want `malloc` to return COW zero pages (that are guaranteed to be instantiable if overcommit is disabled) are two separate issues. In fact `mmap` has a flag to request that the pages get pre-allocated physical memory backing if possible. I find it doubtful that this strategy would be preferable as a default however. It does not improve big-O and can make it worse. – R.. GitHub STOP HELPING ICE Sep 03 '12 at 17:39
6

Surely if you are comparing the speed of initialise and copy to the speed of just copy, then the initialisation should be included in timed section. It appears to me you should actually be comparing this:

// Version 1
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

memcpy(b, a, n*sizeof(double));

toc();

To this:

// Version 2
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

for(i=0; i<n; i++)
    b[i] = 0.0;
memcpy(b, a, n*sizeof(double));

toc();

I expect this will see your 3x speed improvement drop sharply.

EDIT: As suggested by Steve Jessop, you may also want to test a third strategy of only touching one entry per page:

// Version 3
for(i=0; i<n; i++)
    a[i] = 1.0;

tic();

for(i=0; i<n; i+=DOUBLES_PER_PAGE)
    b[i] = 0.0;
memcpy(b, a, n*sizeof(double));

toc();
verdesmarald
  • 11,646
  • 2
  • 44
  • 60
  • The whole point is that I do not want to pay for zeroing of the memory when I later copy other data to it. Anyway, assigning 0s to the array entries itself should only require half of the time it takes to memcpy on initialized arrays (read vs. read+write). The remaining overhead is hence caused by the OS and not a hardware limitation. I see no reason to include it in my time measurements, but I would gladly get rid of it by e.g. some system settings. – angainor Sep 03 '12 at 16:55
  • @angainor You are going to have to pay the time required to map the pages either way, but you are including it in one measurement but not the other. Saying "look it's three times faster when I don't include the time taken to map 80mb worth of pages" is nice and all, but you are comparing apples to oranges. Saying "the code is three times faster (!)" is just plain fallacious. – verdesmarald Sep 03 '12 at 17:15
5

The first bzero runs longer because of (1) lazy page allocation and (2) lazy page zero-initialization by kernel. While second reason is unavoidable because of security reasons, lazy page allocation may be optimized by using larger ("huge") pages.

There are at least two ways to use huge pages on Linux. Hard way is hugetlbfs. Easy way is Transparent huge pages.

Search khugepaged in the list of processes on your system. If such process exists, transparent huge pages are supported, you can use them in your application if you change malloc to this:

posix_memalign((void **)&b, 2*1024*1024, n*sizeof(double));
madvise((void *)b, n*sizeof(double), MADV_HUGEPAGE);
Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
  • That indeed removed the 'memory mapping' overhead and left only the initial zeroing of the data. I wonder if this will work with OpenMP on NUMA architectures with the effect of binding the memory to a specific NUMA node, depending on which CPU the calling thread runs (first-touch memory allocation). But I guess there is no reason why it should not. Thanks! – angainor Sep 04 '12 at 10:53
  • I think, there should be no problem if you bind the memory to a specific NUMA node in aligned pieces of size, which is a multiple of 2 megabytes or if these pieces are much larger than 2 megabytes. For smaller pieces, it is hard to say what is better - huge pages without TLB problems and with faster allocation or normal pages, properly distributed between NUMA nodes. – Evgeny Kluev Sep 04 '12 at 11:23