Can some explain the performance behavior of the following memory allocating C program?

Question

On my machine Time A and Time B swap depending on whether A is defined or not (which changes the order in which the two callocs are called).

I initially attributed this to the paging system. Weirdly, when mmap is used instead of calloc, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As can be seen with strace, the callocs ultimately result in two mmaps, so there is no return-already-allocated-memory magic going on.

I'm running Debian testing on an Intel i7.

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816

#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif

int main() {
  clock_t start, finish;
#ifdef A
  int *arr1 = ALLOC(sizeof(int), SIZE);
  int *arr2 = ALLOC(sizeof(int), SIZE);
#else
  int *arr2 = ALLOC(sizeof(int), SIZE);
  int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
  int i;

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr1[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr2[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  return 0;
}

The output I get:

 ~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop 
Time A: 0.94
Time B: 0.34
 ~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                               
Time A: 0.34
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                                          
Time A: 0.89
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop 
 ~/directory $ ./bench-loop                                      
Time A: 0.91
Time B: 0.92

You should pose your question (clearly) outside the code block. Hiding it in code comments is not helpful. — Daniel Fischer, Apr 10 '12 at 20:56
Might want to drop the C++ tag in favor of *nix. Can you be a bit more specific in what you are looking for? Basically it either uses a memory mapped file or regular allocation ... — AJG85, Apr 10 '12 at 20:58
+1: I have encountered the exact same phenomenon when doing performance measurements before. I never managed to figure out what was going on, so I really hope you get a good answer. — Leo, Apr 17 '12 at 07:36
It's not fair that you are not including the time taken by `calloc` into account. Since you are tracking an issue that you know to be related to memory management, paging, page caching, etc., and timing it via wall clock time, you can't exclude the time used by the memory allocator. — Old Pro, Apr 19 '12 at 03:58
Are you running in 64-bit? I wonder if the answer has to do with extended address space... — Robert Martin, Apr 20 '12 at 05:31
@SCombinator did you look at the answer I posted? I thought your question was interesting and I reproduced the same behavior on my own system. Anyway the time difference for the loops in your code is related to different behavior between the first `calloc` call and the second one in your example. Specifically the first call does a `memset` for some reasons while the second one does not and properly assumes that memory returned by mmap is automatically zeroed out. — Gabriel Southern, Apr 20 '12 at 23:30

score 10 · Answer 1 · answered Apr 10 '12 at 20:59

10

You should also test using malloc instead of calloc. One thing that calloc does is to fill the allocated memory with zeros.

I believe in your case that when you calloc arr1 last and then assign to it, it is already faulted into cache memory, since it was the last one allocated and zero-filled. When you calloc arr1 first and arr2 second, then the zero-fill of arr2 pushes arr1 out of cache.

answered Apr 10 '12 at 20:59

Zan Lynx

53,022
10
79
131

Using `malloc` increases the amount of time taken -- both the loops now consume ~0.9 seconds. But I get your answer, thanks! – sanjoyd Apr 10 '12 at 21:01
1

Umm, but why do the figures swap? By this logic, the first loop should run faster if it uses the array `calloc`ed second. But if it uses the array I `calloc` first, it should drive the hotter array out of the cache (the one `calloc`ed second) and both the loops should run slow, which isn't the case. – sanjoyd Apr 10 '12 at 21:13
1

@SCombinator: How weird. You didn't specify in your question what was going on. It's the opposite of how I'd figured it. Wild guess here: the zero-writes are being streamed to RAM and rewriting that data requires the CPU to do something complicated with the memory controller. – Zan Lynx Apr 10 '12 at 21:54
1

I've always assumed `calloc` to be smart enough to know that mmap returns already zeroed memory. – Per Johansson Apr 16 '12 at 13:47
@PerJohansson: It does but the operating system has to zero those pages so the speed hit is still there. Unless the machine has some kind of DMA engine it uses to zero-fill RAM "for free". – Zan Lynx Jan 29 '14 at 01:32

Morpfh · Answer 2 · 2012-04-18T18:25:29.857

Guess I could have written more, or less, especially as less is more.

The reason can differ from system to system. However; for clib:

The total time used for each operation is the other way around; if you time the calloc + the iteration.

I.e.:

Calloc arr1 : 0.494992654
Calloc arr2 : 0.000021250
Itr arr1    : 0.430646035
Itr arr2    : 0.790992411
Sum arr1    : 0.925638689
Sum arr2    : 0.791013661

Calloc arr1 : 0.503130736
Calloc arr2 : 0.000025906
Itr arr1    : 0.427719162
Itr arr2    : 0.809686047
Sum arr1    : 0.930849898
Sum arr2    : 0.809711953

The first calloc subsequently malloc has a longer execution time then second. A call as i.e. malloc(0) before any calloc etc. evens out the time used for malloc like calls in same process (Explanation below). One can however see an slight decline in time for these calls if one do several in line.

The iteration time, however, will flatten out.

So in short; The total system time used is highest on which ever get alloc'ed first. This is however an overhead that can't be escaped in the confinement of a process.

There is a lot of maintenance going on. A quick touch on some of the cases:

Short on page's

When a process request memory it is served a virtual address range. This range translates by a page table to physical memory. If a page translated byte by byte we would quickly get huge page tables. This, as one, is a reason why memory ranges are served in chunks - or pages. The page size are system dependent. The architecture can also provide various page sizes.

If we look at execution of above code and add some reads from /proc/PID/stat we see this in action (Esp. note RSS):

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 214          Minor faults, (no page memory read)
  UTIME        : 0            Time user mode
  STIME        : 0            Time kernel mode
  VSIZE        : 2039808      Virtual memory size, bytes
  RSS          : 73           Resident Set Size, Number of pages in real memory
} : Init

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 51504        Minor faults, (no page memory read)
  UTIME        : 4            Time user mode
  STIME        : 33           Time kernel mode
  VSIZE        : 212135936    Virtual memory size, bytes
  RSS          : 51420        Resident Set Size, Number of pages in real memory
} : Post calloc arr1

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 51515        Minor faults, (no page memory read)
  UTIME        : 4            Time user mode
  STIME        : 33           Time kernel mode
  VSIZE        : 422092800    Virtual memory size, bytes
  RSS          : 51428        Resident Set Size, Number of pages in real memory
} : Post calloc arr2

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 51516        Minor faults, (no page memory read)
  UTIME        : 36           Time user mode
  STIME        : 33           Time kernel mode
  VSIZE        : 422092800    Virtual memory size, bytes
  RSS          : 51431        Resident Set Size, Number of pages in real memory
} : Post iteration arr1

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 102775       Minor faults, (no page memory read)
  UTIME        : 68           Time user mode
  STIME        : 58           Time kernel mode
  VSIZE        : 422092800    Virtual memory size, bytes
  RSS          : 102646       Resident Set Size, Number of pages in real memory
} : Post iteration arr2

PID Stat {
  PID          : 4830         Process ID
  MINFLT       : 102776       Minor faults, (no page memory read)
  UTIME        : 68           Time user mode
  STIME        : 69           Time kernel mode
  VSIZE        : 2179072      Virtual memory size, bytes
  RSS          : 171          Resident Set Size, Number of pages in real memory
} : Post free()

As we can see pages actually allocated in memory is postponed for arr2 awaiting page request; which lasts until iteration begins. If we add a malloc(0) before calloc of arr1 we can register that neither array is allocated in physical memory before iteration.

As a page might not be used it is more efficient to do the mapping on request. This is why when the process i.e. do a calloc the sufficient number of pages are reserved, but not necessarily actually allocated in real memory.

When an address is referenced the page table is consulted. If the address is in a page which is not allocated the system serves a page fault and the page is subsequently allocated. Total sum of allocated pages is called Resident Set Size (RSS).

We can do an experiment with our array by iterating (touching) i.e. 1/4 of it. Here I have also added malloc(0) before any calloc.

Pre iteration 1/4:
RSS          : 171              Resident Set Size, Number of pages in real meory

for (i = 0; i < SIZE / 4; ++i)
    arr1[i] = 0;

Post iteration 1/4:
RSS          : 12967            Resident Set Size, Number of pages in real meory

Post iteration 1/1:
RSS          : 51134            Resident Set Size, Number of pages in real meory

To further speed up things most systems additionally cache the N most recent page table entries in a translation lookaside buffer (TLB).

brk, mmap

When a process (c|m|…)alloc the upper bounds of the heap is expanded by brk() or sbrk(). These system calls are expensive and to compensate for this malloc collect multiple smaller calls in to one bigger brk().

This also affects free() as a negative brk() also is resource expensive they are collected and performed as a bigger operation.

For huge request; i.e. like the one in your code, malloc() uses mmap(). The threshold for this, which is configurable by mallopt(), is an educated value

We can have fun with this modifying the SIZE in your code. If we utilize malloc.h and use,

struct mallinfo minf = mallinfo();

(no, not milf), we can show this (Note Arena and Hblkhd, …):

Initial:

mallinfo {
  Arena   :         0 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         0 (Number of chunks allocated with mmap)
  Hblkhd  :         0 (Bytes allocated with mmap)
  Uordblks:         0 (Memory occupied by chunks handed out by malloc)
  Fordblks:         0 (Memory occupied by free chunks)
  Keepcost:         0 (Size of the top-most releasable chunk)
} : Initial

MAX = ((128 * 1024) / sizeof(int)) 

mallinfo {
  Arena   :         0 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         1 (Number of chunks allocated with mmap)
  Hblkhd  :    135168 (Bytes allocated with mmap)
  Uordblks:         0 (Memory occupied by chunks handed out by malloc)
  Fordblks:         0 (Memory occupied by free chunks)
  Keepcost:         0 (Size of the top-most releasable chunk)
} : After malloc arr1

mallinfo {
  Arena   :         0 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         2 (Number of chunks allocated with mmap)
  Hblkhd  :    270336 (Bytes allocated with mmap)
  Uordblks:         0 (Memory occupied by chunks handed out by malloc)
  Fordblks:         0 (Memory occupied by free chunks)
  Keepcost:         0 (Size of the top-most releasable chunk)
} : After malloc arr2

Then we subtract sizeof(int) from MAX and get:

mallinfo {
  Arena   :    266240 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         0 (Number of chunks allocated with mmap)
  Hblkhd  :         0 (Bytes allocated with mmap)
  Uordblks:    131064 (Memory occupied by chunks handed out by malloc)
  Fordblks:    135176 (Memory occupied by free chunks)
  Keepcost:    135176 (Size of the top-most releasable chunk)
} : After malloc arr1

mallinfo {
  Arena   :    266240 (Bytes of memory allocated with sbrk by malloc)
  Ordblks :         1 (Number of chunks not in use)
  Hblks   :         0 (Number of chunks allocated with mmap)
  Hblkhd  :         0 (Bytes allocated with mmap)
  Uordblks:    262128 (Memory occupied by chunks handed out by malloc)
  Fordblks:      4112 (Memory occupied by free chunks)
  Keepcost:      4112 (Size of the top-most releasable chunk)
} : After malloc arr2

We register that the system works as advertised. If size of allocation is below threshold sbrk is used and memory handled internally by malloc, else mmap is used.

The structure of this also helps on preventing fragmentation of memory etc.

Point being that the malloc family is optimized for general usage. However mmap limits can be modified to meet special needs.

Note this (and down trough 100+ lines) when / if modifying mmap threshold. .

This can be further observed if we fill (touch) every page of arr1 and arr2 before we do the timing:

Touch pages … (Here with page size of 4 kB)

for (i = 0; i < SIZE; i += 4096 / sizeof(int)) {
    arr1[i] = 0;
    arr2[i] = 0;
}

Itr arr1    : 0.312462317
CPU arr1    : 0.32

Itr arr2    : 0.312869158
CPU arr2    : 0.31

Also see:

Synopsis of compile-time options
Vital statistics
… actually the entire file is a nice read.

Sub notes:

So, the CPU knows the physical address then? Nah.

In the world of memory a lot has to be addressed ;). A core hardware for this is the memory management unit (MMU). Either as an integrated part of the CPU or external chip.

The operating system configure the MMU on boot and define access for various regions (read only, read-write, etc.) thus giving a level of security.

The address we as mortals see is the logical address that the CPU uses. The MMU translates this to a physical address.

The CPU's address consist of two parts: a page address and a offset. [PAGE_ADDRESS.OFFSET]

And the process of getting a physical address we can have something like:

.-----.                          .--------------.
| CPU > --- Request page 2 ----> | MMU          |
+-----+                          | Pg 2 == Pg 4 |
      |                          +------v-------+
      +--Request offset 1 -+            |
                           |    (Logical page 2 EQ Physical page 4)
[ ... ]     __             |            |
[ OFFSET 0 ]  |            |            |
[ OFFSET 1 ]  |            |            |
[ OFFSET 2 ]  |            |            |     
[ OFFSET 3 ]  +--- Page 3  |            |
[ OFFSET 4 ]  |            |            |
[ OFFSET 5 ]  |            |            |
[ OFFSET 6 ]__| ___________|____________+
[ OFFSET 0 ]  |            |
[ OFFSET 1 ]  | ...........+
[ OFFSET 2 ]  |
[ OFFSET 3 ]  +--- Page 4
[ OFFSET 4 ]  |
[ OFFSET 5 ]  |
[ OFFSET 6 ]__|
[ ... ]

A CPU's logical address space is directly linked to the address length. A 32-bit address processor has a logical address space of 2³² bytes. The physical address space is how much memory the system can afford.

There is also the handling of fragmented memory, re-alignment etc.

This brings us into the world of swap files. If a process request more memory then is physically available; one or several pages of other process(es) are transfered to disk/swap and their pages "stolen" by the requesting process. The MMU keeps track of this; thus the CPU doesn't have to worry about where the memory is actually located.

This further brings us on to dirty memory.

If we print some information from /proc/[pid]/smaps, more specific the range of our arrays we get something like:

Start:
b76f3000-b76f5000
Private_Dirty:         8 kB

Post calloc arr1:
aaeb8000-b76f5000
Private_Dirty:        12 kB

Post calloc arr2:
9e67c000-b76f5000
Private_Dirty:        20 kB

Post iterate 1/4 arr1
9e67b000-b76f5000
Private_Dirty:     51280 kB

Post iterate arr1:
9e67a000-b76f5000
Private_Dirty:    205060 kB

Post iterate arr2:
9e679000-b76f5000
Private_Dirty:    410096 kB

Post free:
9e679000-9e67d000
Private_Dirty:        16 kB
b76f2000-b76f5000
Private_Dirty:        12 kB

When a virtual page is created a system typically clears a dirty bit in the page.
When the CPU writes to a part of this page the dirty bit is set; thus when swapped the pages with dirty bits are written, clean pages are skipped.

David Schwartz · Answer 3 · 2012-04-10T22:47:37.090

3

It's just a matter of when the process memory image expands by a page.

edited Apr 10 '12 at 22:47

answered Apr 10 '12 at 20:57

David Schwartz

179,497
17
214
278

This has nothing to do with the stack -- all of the memory in question here is coming from the heap, which comes from `mmap`. – Adam Rosenfield Apr 10 '12 at 21:57
Fixed. Thanks. I actually confused `calloc` with `alloca`! – David Schwartz Apr 10 '12 at 22:48
I don't understand -- both the allocations are an even number of pages. – sanjoyd Apr 11 '12 at 14:31
3

When you call the allocation function, the actual allocation doesn't take place yet. It takes place on first use. The allocation is like a deposit -- the bank records the money in your account but there's no specific money that's yours yet. First use is like a withdrawal, the bank has to actually go find some money and give it to you. – David Schwartz Apr 11 '12 at 19:34
Best answer yet, so succinct! – Prof. Falken Apr 19 '12 at 20:47

Gabriel Southern · Accepted Answer · 2012-04-21T04:06:41.387

Short Answer

The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.

Details

Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:

Insert a calloc call before your first ALLOC call. You will see that after this the Time for Time A and Time B are the same.
Use the clock() function to check how long each of the ALLOC calls take. In the case where they are both using calloc you will see that the first call takes much longer than the second one.
Use time to time the execution time of the calloc version and the USE_MMAP version. When I did this I saw that the execution time for USE_MMAP was consistently slightly less.
I ran with strace -tt -T which shows both the time of when the system call was made and how long it took. Here is part of the output:

Strace output:

21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>

You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call. Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.

I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset. You can see the source code for calloc here (look for __libc_calloc) if you are interested. As for why calloc is doing the memset on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.

As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.

also I originally thought that the behavior was related to the cache, but my guess now is the TLB or something else related to virtual memory and page allocation. The amount of memory allocated is so large that I don't think it will remain in the cache. Regardless I am confident that the performance difference is related to whether or not the memory is explicitly zeroed by `calloc`. — Gabriel Southern, Apr 19 '12 at 05:56
Skimming the source I think the reason the second calloc may not need to zero any mem is because routine knows that mem freshly allocated to process by OS is already zeroed. — Brian Swift, Apr 20 '12 at 18:03
That's true, but it's also true of the first `calloc` call in this example. So really neither of them should have to call `memset`. — Gabriel Southern, Apr 20 '12 at 18:14

Huygens · Answer 5 · 2012-04-20T09:03:41.347

Summary: The time difference is explained when analysing the time is takes to allocate the arrays. The last allocated calloc takes just a bit more time whereas the other (or all when using mmap) take virtualy no time. The real allocation in memory is probably deferred when first accessed.

I don't know enough about the internal of memory allocation on Linux. But I ran your script slightly modified: I've added a third array and some extra iterations per array operations. And I have taken into account the remark of Old Pro that the time to allocate the arrays was not taken into account.

Conclusion: Using calloc takes longer than using mmap for the allocation (mmap virtualy uses no time when you allocate the memory, it's probably postponed later when fist accessed), and using my program there is almost no difference in the end between using mmap or calloc for the overall program execution.

Anyway, first remark, both memory allocation happen in the memory mapping region and not in the heap. To verify this, I've added a quick n' dirty pause so you can check the memory mapping of the process (/proc//maps)

Now to your question, the last allocated array with calloc seems to be really allocated in memory (not postponed). As arr1 and arr2 behaves now exactly the same (the first iteration is slow, subsequent iterations are faster). Arr3 is faster for the first iteration because the memory was allocated earlier. When using the A macro, then it is arr1 which benefits from this. My guess would be that the kernel has preallocated the array in memory for the last calloc. Why? I don't know... I've tested it also with only one array (so I removed all occurence of arr2 and arr3), then I have the same time (roughly) for all 10 iterations of arr1.

Both malloc and mmap behave the same (results not shown below), the first iteration is slow, subsequent iterations are faster for all 3 arrays.

Note: all results were coherent accross the various gcc optimisation flags (-O0 to -O3), so it doesn't look like the root of the behaviour is derived from some kind of gcc optimsation.

Note2: Test run on Ubuntu Precise Pangolin (kernel 3.2), with GCC 4.6.3

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816
#define ITERATION 10

#if defined(USE_MMAP)
#  define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#elif defined(USE_MALLOC)
#  define ALLOC(a, b) (malloc(b * a))
#elif defined(USE_CALLOC)
#  define ALLOC calloc
#else
#  error "No alloc routine specified"
#endif

int main() {
  clock_t start, finish, gstart, gfinish;
  start = clock();
  gstart = start;
#ifdef A
  unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
#else
  unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
  unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
#endif
  finish = clock();
  unsigned int i, j;
  double intermed, finalres;

  intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
  printf("Time to create: %.2f\n", intermed);

  printf("arr1 addr: %p\narr2 addr: %p\narr3 addr: %p\n", arr1, arr2, arr3);

  finalres = 0;
  for (j = 0; j < ITERATION; j++)
  {
    start = clock();
    {
      for (i = 0; i < SIZE; i++)
        arr1[i] = (i + 13) * 5;
    }
    finish = clock();

    intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
    finalres += intermed;
    printf("Time A: %.2f\n", intermed);
  }

  printf("Time A (average): %.2f\n", finalres/ITERATION);


  finalres = 0;
  for (j = 0; j < ITERATION; j++)
  {
    start = clock();
    {
      for (i = 0; i < SIZE; i++)
        arr2[i] = (i + 13) * 5;
    }
    finish = clock();

    intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
    finalres += intermed;
    printf("Time B: %.2f\n", intermed);
  }

  printf("Time B (average): %.2f\n", finalres/ITERATION);


  finalres = 0;
  for (j = 0; j < ITERATION; j++)
  {
    start = clock();
    {
      for (i = 0; i < SIZE; i++)
        arr3[i] = (i + 13) * 5;
    }
    finish = clock();

    intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
    finalres += intermed;
    printf("Time C: %.2f\n", intermed);
  }

  printf("Time C (average): %.2f\n", finalres/ITERATION);

  gfinish = clock();

  intermed = ((double)(gfinish - gstart))/CLOCKS_PER_SEC;
  printf("Global Time: %.2f\n", intermed);

  return 0;
}

Results:

Using USE_CALLOC

Time to create: 0.13
arr1 addr: 0x7fabcb4a6000
arr2 addr: 0x7fabe917d000
arr3 addr: 0x7fac06e54000
Time A: 0.67
Time A: 0.48
...
Time A: 0.47
Time A (average): 0.48
Time B: 0.63
Time B: 0.47
...
Time B: 0.48
Time B (average): 0.48
Time C: 0.45
...
Time C: 0.46
Time C (average): 0.46

With USE_CALLOC and A

Time to create: 0.13
arr1 addr: 0x7fc2fa206010
arr2 addr: 0xx7fc2dc52e010
arr3 addr: 0x7fc2be856010
Time A: 0.44
...
Time A: 0.43
Time A (average): 0.45
Time B: 0.65
Time B: 0.47
...
Time B: 0.46
Time B (average): 0.48
Time C: 0.65
Time C: 0.48
...
Time C: 0.45
Time C (average): 0.48

Using USE_MMAP

Time to create: 0.0
arr1 addr: 0x7fe6332b7000
arr2 addr: 0x7fe650f8e000
arr3 addr: 0x7fe66ec65000
Time A: 0.55
Time A: 0.48
...
Time A: 0.45
Time A (average): 0.49
Time B: 0.54
Time B: 0.46
...
Time B: 0.49
Time B (average): 0.50
Time C: 0.57
...
Time C: 0.40
Time C (average): 0.43

I've tried to set-up Flame Graphs to see where the time is lost. But I was unsuccessful until now due mainly to lack of memory in my VM and lack of knowledge with SystemTap or Perf. See article: http://dtrace.org/blogs/brendan/2012/03/17/linux-kernel-performance-flame-graphs/ — Huygens, Apr 18 '12 at 07:10

Can some explain the performance behavior of the following memory allocating C program?

5 Answers5

Short on page's

brk, mmap

Sub notes:

Linked