6

I need to clear large address ranges (order of 200 pages at a time) in Linux. There are two approaches I tried -

  1. Use memset - The simplest way to clear the address range. Performed a bit slower than method 2.

  2. Use munmap/mmap - I called munmap on the address range then mmap'd the same address again with the same permissions. Since MAP_ANONYMOUS is passed, the pages are cleared.

The second method makes a benchmark run 5-10% faster. The benchmark ofcourse does a lot more than just clearing the pages. If I understand correctly, this is because the operating system has a pool of zero'd pages which it maps to the address range.

But I don't like this way because the munmap and mmap is not atomic. In the sense that another mmap (with NULL as first argument) done simultaneously could render my address range un-usable.

So my question is does Linux provide a system call that can swap out the physical pages for an address range with zero-pages?

I tried to look at the source of glibc (specifically memset) to see if they use any technique to do this efficiently. But I couldn't find anything.

Ajay Brahmakshatriya
  • 8,993
  • 3
  • 26
  • 49
  • I understand acquiring a lock on the memory and using `munmap`/`mmap` is a potential solution. But I want to avoid doing that if possible. – Ajay Brahmakshatriya Apr 18 '18 at 09:54
  • 1
    Perhaps: https://stackoverflow.com/questions/24171602/mmap-resetting-old-memory-to-a-zerod-non-resident-state – Antti Haapala -- Слава Україні Apr 18 '18 at 09:58
  • also https://stackoverflow.com/questions/18595123/zero-a-large-memory-mapping-with-madvise – Antti Haapala -- Слава Україні Apr 18 '18 at 10:01
  • 3
    Note:the `mmap()` doesn't actually *clear* the memory. It only clones the pages from `/dev/zero`, with COW set. The cost will come later, once (if!) the pages are referenced. – wildplasser Apr 18 '18 at 10:02
  • @AnttiHaapala I am precisely trying to solve the same problem (of writing a custom `mmap/munmap`). I use similar bookkeeping bits and use `MAP_NORESERVE` with `PROTO_NONE` for initial `mmap` followed by calls to `mprotect` as I simulate the `mmap/munmap`. I guess, `madvice` on `munmap` should solve the problem. Thanks I will try it out. – Ajay Brahmakshatriya Apr 18 '18 at 10:03
  • @wildplasser I agree, but the cost might be removed (even for later) if the os has some zero'd out physical pages. Which is what I think I am observing. – Ajay Brahmakshatriya Apr 18 '18 at 10:04
  • Are you talking about page-deduplication? IIRC this is only used in VM-supervisors. (and it can introduce side-channels) – wildplasser Apr 18 '18 at 10:07
  • @wildplasser I am not sure what page-deduplication is, what I meant is that when the write eventually happens, the OS doesn't really have to write zero on the pages because already zero'd out physical pages might be available. – Ajay Brahmakshatriya Apr 18 '18 at 10:08
  • I don't think the OS can find free physical pages based on their *content* . – wildplasser Apr 18 '18 at 10:13
  • 1
    @wildplasser that's not what Ajay suggested - instead, that if it is unmapped, chances are that by the time it needs to be mapped again, the idle process has zeroed some page frames already... – Antti Haapala -- Слава Україні Apr 18 '18 at 10:16
  • @wildplasser yes, like @AntiHaapala said, the idle process clearing some pages is what I meant. And also in Linux, I think there is a separate linked list of zero free physical pages maintained (again, added to by the idle process). Thus the OS doesn't have to *find* physical page with all 0 but just pop one from the list when needed. I think this is what is used to map the `.bss` section on load too. – Ajay Brahmakshatriya Apr 18 '18 at 10:20
  • *Use munmap/mmap - I called munmap on the address range then mmap'd the same address again with the same permissions. Since MAP_ANONYMOUS is passed, the pages are cleared.* Those new page mappings might not actually be mapped to real physical pages yet. You need to add actually accessing at least one byte per page of the mapping to ensure you're including the time needed to actually recreate physical page mappings if necessary. – Andrew Henle Apr 18 '18 at 11:15
  • 1
    @AndrewHenle That's what I said an hour ago, but the OP doesn't really seem to understand what COW on a cloned /dev/zero does @ OP: see for instance https://stackoverflow.com/a/8507066/905902 – wildplasser Apr 18 '18 at 11:42
  • @wildplasser I agree - OP seems to think all the work is done when `mmap()` returns. For all readers: it's not done yet. Just because you now have a logical mapping to a page in your process's address space doesn't mean that there's an actual physical page of memory behind that logical mapping. Those logical-to-physical mappings only get created when they need to be created, and it's one *very* expensive process. OP almost certainly failed to benchmark that. – Andrew Henle Apr 18 '18 at 11:48
  • @AndrewHenle I am not assuming that the physical mappings are created immediately. I think I understand what COW on /dev/zero does. What I just meant is that the cost of actually zeroing the pages (whenever it is done - when it is actually written to) would be saved by idle-zeroing. Art in his answer seems to suggest that it is not very significant. – Ajay Brahmakshatriya Apr 18 '18 at 11:57
  • @AndrewHenle can you see the comment I posted on wildplasser's answer below. I think that might clear the disagreement we have. – Ajay Brahmakshatriya Apr 18 '18 at 16:37
  • related: [Why malloc+memset is slower than calloc?](https://stackoverflow.com/q/2688466/995714) – phuclv Apr 18 '18 at 16:46
  • 1
    @AjayBrahmakshatriya The code you posted doesn't measure just the time difference between `memset()`, `madvise()`, and `mmap()` to clear existing pages. It counts the time for all the initial `mmap()` and final `munmaps()` plus extraneous loops to write `'A'` or `'B'` to the entire mapped page. None of that is applicable to the performance differences between `memset()`,`madvise()`, and `mmap()` to zero out memory. FWIW, a quick test I ran showed that a simple, single-threaded call to `memset()` is somewhere between five and ten times faster than redoing `mmap()`. – Andrew Henle Apr 18 '18 at 17:04
  • @AndrewHenle I think counting the time for the loop for writing 'B' is essential to the clearing out of memory because of COW like you yourself mentioned. – Ajay Brahmakshatriya Apr 18 '18 at 17:06
  • @AndrewHenle coming to your test, how many pages did you use? Because mmap has a fixed system call cost which is pretty high. Maybe mmap might become better for bigger allocation sizes. – Ajay Brahmakshatriya Apr 18 '18 at 17:09
  • rewriting the silly loops to `memset (buffer, 'A', length);` (same for B) reduces the total running time from ~50 to ~30 seconds. – wildplasser Apr 18 '18 at 18:48
  • @wildplasser yes that is true. But the purpose is to compare memset approach to mmap approach. The rest of the overheads should be the same for both the cases I believe. – Ajay Brahmakshatriya Apr 18 '18 at 18:50

3 Answers3

7

memset() appears to be about an order of magnitude faster than mmap() to get a new zero-filled page, at least on the Solaris 11 server I have access to right now. I strongly suspect that Linux will produce similar results.

I wrote a small benchmark program:

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <strings.h>

#include <sys/time.h>

#define NUM_BLOCKS ( 512 * 1024 )
#define BLOCKSIZE ( 4 * 1024 )

int main( int argc, char **argv )
{
    int ii;

    char *blocks[ NUM_BLOCKS ];

    hrtime_t start = gethrtime();

    for ( ii = 0; ii < NUM_BLOCKS; ii++ )
    {
        blocks[ ii ] = mmap( NULL, BLOCKSIZE,
            PROT_READ | PROT_WRITE,
            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
        // force the creation of the mapping
        blocks[ ii ][ ii % BLOCKSIZE ] = ii;
    }

    printf( "setup time:    %lf sec\n",
        ( gethrtime() - start ) / 1000000000.0 );

    for ( int jj = 0; jj < 4; jj++ )
    {
        start = gethrtime();

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            blocks[ ii ] = mmap( blocks[ ii ],
                BLOCKSIZE, PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
            blocks[ ii ][ ii % BLOCKSIZE ] = 0;
        }

        printf( "mmap() time:   %lf sec\n",
            ( gethrtime() - start ) / 1000000000.0 );
        start = gethrtime();

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            memset( blocks[ ii ], 0, BLOCKSIZE );
        }

        printf( "memset() time: %lf sec\n",
            ( gethrtime() - start ) / 1000000000.0 );
    }

    return( 0 );
}

Note that writing a single byte anywhere in the page is all that's needed to force the creation of the physical page.

I ran it on my Solaris 11 file server (the only POSIX-style system I have running on bare metal right now). I didn't test madvise() on my Solaris system because Solaris, unlike Linux, doesn't guarantee that the mapping will be repopulated with zero-filled pages, only that "the system starts to free the resources".

The results:

setup time:    11.144852 sec
mmap() time:   15.159650 sec
memset() time: 1.817739 sec
mmap() time:   15.029283 sec
memset() time: 1.788925 sec
mmap() time:   15.083473 sec
memset() time: 1.780283 sec
mmap() time:   15.201085 sec
memset() time: 1.771827 sec

memset() is almost an order of magnitude faster. When I get a chance, I'll rerun that benchmark on Linux, but it'll likely have to be on a VM (AWS etc.)

That's not surprising - mmap() is expensive, and the kernel still needs to zero the pages at some time.

Interestingly, commenting out one line

        for ( ii = 0; ii < NUM_BLOCKS; ii++ )
        {
            blocks[ ii ] = mmap( blocks[ ii ],
                BLOCKSIZE, PROT_READ | PROT_WRITE,
                MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
            //blocks[ ii ][ ii % BLOCKSIZE ] = 0;
        }

produces these results:

setup time:    10.962788 sec
mmap() time:   7.524939 sec
memset() time: 10.418480 sec
mmap() time:   7.512086 sec
memset() time: 10.406675 sec
mmap() time:   7.457512 sec
memset() time: 10.296231 sec
mmap() time:   7.420942 sec
memset() time: 10.414861 sec

The burden of forcing the creation of the physical mapping has shifted to the memset() call, leaving only the implicit munmap() in the test loops, where the mappings are destroyed when the MAP_FIXED mmap() call replaces them. Note that the just the munmap() takes about 3-4 times longer than keeping the pages in the address space and memset()'ing them to zeros.

The cost of mmap() isn't really the mmap()/munmap() system call itself, it's that the new page requires a lot of behind-the-scenes CPU cycles to create the actual physical mapping, and that doesn't happen in the mmap() system call itself - it happens afterwards, when the process accesses the memory page.

If you doubt the results, note this LMKL post from Linus Torvalds himself:

...

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap:

  • quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
  • ...

Profiling the code using Solaris Studio's collect and analyzer tools produced the following output:

Source File: mm.c

Inclusive        Inclusive        Inclusive         
Total CPU Time   Sync Wait Time   Sync Wait Count   Name
sec.             sec.                               
                                                      1. #include <stdio.h>
                                                      2. #include <sys/mman.h>
                                                      3. #include <string.h>
                                                      4. #include <strings.h>
                                                      5. 
                                                      6. #include <sys/time.h>
                                                      7. 
                                                      8. #define NUM_BLOCKS ( 512 * 1024 )
                                                      9. #define BLOCKSIZE ( 4 * 1024 )
                                                     10. 
                                                     11. int main( int argc, char **argv )
                                                         <Function: main>
 0.011           0.               0                  12. {
                                                     13.     int ii;
                                                     14. 
                                                     15.     char *blocks[ NUM_BLOCKS ];
                                                     16. 
 0.              0.               0                  17.     hrtime_t start = gethrtime();
                                                     18. 
 0.129           0.               0                  19.     for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     20.     {
                                                     21.         blocks[ ii ] = mmap( NULL, BLOCKSIZE,
                                                     22.             PROT_READ | PROT_WRITE,
 3.874           0.               0                  23.             MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
                                                     24.         // force the creation of the mapping
 7.928           0.               0                  25.         blocks[ ii ][ ii % BLOCKSIZE ] = ii;
                                                     26.     }
                                                     27. 
                                                     28.     printf( "setup time:    %lf sec\n",
 0.              0.               0                  29.         ( gethrtime() - start ) / 1000000000.0 );
                                                     30. 
 0.              0.               0                  31.     for ( int jj = 0; jj < 4; jj++ )
                                                     32.     {
 0.              0.               0                  33.         start = gethrtime();
                                                     34. 
 0.560           0.               0                  35.         for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     36.         {
                                                     37.             blocks[ ii ] = mmap( blocks[ ii ],
                                                     38.                 BLOCKSIZE, PROT_READ | PROT_WRITE,
33.432           0.               0                  39.                 MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
29.535           0.               0                  40.             blocks[ ii ][ ii % BLOCKSIZE ] = 0;
                                                     41.         }
                                                     42. 
                                                     43.         printf( "mmap() time:   %lf sec\n",
 0.              0.               0                  44.             ( gethrtime() - start ) / 1000000000.0 );
 0.              0.               0                  45.         start = gethrtime();
                                                     46. 
 0.101           0.               0                  47.         for ( ii = 0; ii < NUM_BLOCKS; ii++ )
                                                     48.         {
 7.362           0.               0                  49.             memset( blocks[ ii ], 0, BLOCKSIZE );
                                                     50.         }
                                                     51. 
                                                     52.         printf( "memset() time: %lf sec\n",
 0.              0.               0                  53.             ( gethrtime() - start ) / 1000000000.0 );
                                                     54.     }
                                                     55. 
 0.              0.               0                  56.     return( 0 );
 0.              0.               0                  57. }

                                                    Compile flags:  /opt/SUNWspro/bin/cc -g -m64  mm.c -W0,-xp.XAAjaAFbs71a00k.

Note the large amount of time spent in mmap(), and also in the setting of a single byte in each newly-mapped page.

This is an overview from the analyzer tool. Note the large amount of system time:

Profile overview

The large amount of system time consumed is the time taken to map and unmap the physical pages.

This timeline shows when all that time was consumed:

enter image description here

The light green is system time - that's all in the mmap() loops. You can see that switch over to dark-green user time when the memset() loops run. I've highlighted one of those instances so you can see what's going on at that time.

Updated results from a Linux VM:

setup time:    2.567396 sec
mmap() time:   2.971756 sec
memset() time: 0.654947 sec
mmap() time:   3.149629 sec
memset() time: 0.658858 sec
mmap() time:   2.800389 sec
memset() time: 0.647367 sec
mmap() time:   2.915774 sec
memset() time: 0.646539 sec

This tracks exactly with what I stated in my comment yesterday: FWIW, a quick test I ran showed that a simple, single-threaded call to memset() is somewhere between five and ten times faster than redoing mmap()

I simply do not understand this fascination with mmap(). mmap() is one hugely expensive call, and it's a forced single-threaded operation - there's only one set of physical memory on the machine. mmap() is not only S-L-O-W, it impacts both the entire process address space and the VM system on the entire host.

Using any form of mmap() just to zero out memory pages is counterproductive. First, the pages don't get zeroed for free - something has to memset() them to clear them. It just doesn't make any sense to add tearing down and recreating a memory mapping to that memset() just to clear a page of RAM.

memset() also has the advantage that more than one thread can be clearing memory at any one time. Making changes to memory mappings is a single-threaded process.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
  • First of all, thanks a lot for the detailed analysis and for specifically profiling each step in the program. I think the time for the actual `mmap` call can be ignored because in my case like I said I have around 200 continuous pages. So there will be a single `mmap` call. But yes there is a significant cost to set a 0 in the page. This I think is because of the page faults (which won't be there in memset). – Ajay Brahmakshatriya Apr 19 '18 at 04:33
  • I should have probably mentioned in the question that I have to do `mprotect` before the `memset` (because I am implementing a allocator). I think in that case the cost of just a single `mmap` would be less than `mprotect` + `memset`. Notice that the cost of `memset` would be high if it is after a `mprotect` (because it might also have a page fault because of lazy allocation). – Ajay Brahmakshatriya Apr 19 '18 at 04:35
  • Anyway, this clears most of my doubts, in general case `memset` is faster than `mmap`. But for my specific case since I also have to change permissions (from `PROT_NONE` to `PROT_READ | PROT_WRITE`), I think a single call to `mmap` should work better. – Ajay Brahmakshatriya Apr 19 '18 at 04:36
  • 2
    @AjayBrahmakshatriya `the time for the actual mmap call can be ignored ...` You are wrong. You are torturing the memory subsystem by allocating/deallocating 200 pages on every loop. This will cause a lot of work *for the kernel* to update the page tables, the free lists and the lru-cache. Plus: the kernel will need to aquire locks on these resource-lists, which could affect other processes, too. – joop Apr 19 '18 at 09:08
  • 1
    *Anyway, this clears most of my doubts, in general case memset is faster than mmap. But for my specific case since I also have to change permissions (from `PROT_NONE` to `PROT_READ | PROT_WRITE`), I think a single call to `mmap` should work better.* **Why** would you even begin to think flipping a few bits in a page table entry is slower than completely tearing down and redoing an entire page mapping? Did you even bother benchmarking and testing that assumption? – Andrew Henle Apr 19 '18 at 09:57
2

madvise(..., MADV_DOTNEED) should be equivalent to munmap/mmap on anonymous mappings on Linux. It's a bit weird because that's not how I understand what the semantics of "don't need" should be, but it does throw away the page(s) on Linux.

$ cat > foo.c
#include <sys/types.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int *foo = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    *foo = 42;
    printf("%d\n", *foo);
    madvise(foo, getpagesize(), MADV_DONTNEED);
    printf("%d\n", *foo);
    return 0;
}
$ cc -o foo foo.c && ./foo
42
0
$ uname -sr
Linux 3.10.0-693.11.6.el7.x86_64

MADV_DONTNEED does not do that on other operating systems so this is definitely not portable. For example:

$ cc -o foo foo.c && ./foo
42
42
$ uname -sr
Darwin 17.5.0

But, you don't need to unmap, you can just overwrite the mapping. As a bonus this is much more portable:

$ cat foo.c
#include <sys/types.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
    int *foo = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    *foo = 42;
    printf("%d\n", *foo);
    mmap(foo, getpagesize(), PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
    printf("%d\n", *foo);
    return 0;
}
$ cc -o foo foo.c && ./foo
42
0
$

Also, I'm not actually sure if you benchmarked things properly. Creating and dropping mappings can be quite expensive and I don't think idle zeroing would help that much. Newly mmap:ed pages are not actually mapped until they are used for the first time and on Linux this means written and not read because Linux does silly things with copy-on-write zero pages if the first access to a page is a read instead of a write. So unless you benchmark writes to the newly mmap:ed pages I suspect that neither your previous solution, nor the ones I suggested here will actually be faster than just a dumb memset.

Art
  • 19,807
  • 1
  • 34
  • 60
  • Yes, I read that madvise is not portable. The portable posix_madvise doesn't have the behaviour of clearing. – Ajay Brahmakshatriya Apr 18 '18 at 11:18
  • And I didn't know that mmap would throw away the old mapping. That seems like a better solution then. Shouldn't the second mmap fail though? Because MAP_FIXED is passed and the area is already in use? – Ajay Brahmakshatriya Apr 18 '18 at 11:19
  • If the behaviour is defined for direct mmap to work. I will just use that. Can I also change the permissions of the page in the second mmap? This way I can do away with the call to mprotect in my allocator. – Ajay Brahmakshatriya Apr 18 '18 at 11:21
  • @AjayBrahmakshatriya `mmap( ... MAP_FIXED)` overwrites whatever was there before, no errors if there are previous mappings. – Art Apr 18 '18 at 11:21
  • 2
    @AjayBrahmakshatriya It's pretty much equivalent to creating a whole new mapping and replacing whatever used to be there. So you can change permissions. But... before you do that read that last paragraph I edited in, I'm not sure if you're benchmarking the right thing. The fact that your benchmark of munmap/mmap is faster than memset smells wrong to me from what I know about memory management (and I've worked a lot on a vm system in a kernel). – Art Apr 18 '18 at 11:24
  • *nor the ones I suggested here will actually be faster than just a dumb memset.* Not only that, you can use multiple threads to do parallel `memset()` calls. You can't do parallel `mmap()` calls because they will have to be serialized since there's only one process address space. – Andrew Henle Apr 18 '18 at 11:24
  • 3
    On the other hand, the last paper I read about the cost of mapping new pages vs. zeroing/copying had a date that started with "199". On the third hand if your kernel is patched for Meltdown/Spectre I can't possibly see how taking a fault for each page could be more efficient than a properly implemented memset. On the fourth hand, you could try using `MAP_POPULATE` on Linux and that could do the trick. On the fifth hand, relying on idle loop zeroing leads to a system with positive feedback where performance degrades under high load which increases load. – Art Apr 18 '18 at 11:46
  • I will re-evaluate the benchmark for `memset` and double `mmap`. I don't really want to pre-fault using `MAP_POPULATE` because my access patterns are such that only a few of the pages would be touched in the beginning. Others might be touched very late. I will keep all the points (only so many hands) you mentioned above and test it thoroughly. – Ajay Brahmakshatriya Apr 18 '18 at 11:51
  • Can't get back to you with all the evaluations right away. Have to finish some other work right now. But thanks for the insight. – Ajay Brahmakshatriya Apr 18 '18 at 11:52
  • @AjayBrahmakshatriya I'm actually very curious, so let me know here if you get some results. – Art Apr 18 '18 at 11:53
  • @Art I ran [this](https://ideone.com/NzSo9H) just to check the time for 1. clearing -> 2. reading 3. Rewriting. And as you suspected, `memset` was the fastest (2m24s), then `mmap` (2m31s) and finally `madvise` (2m35s). This is according to what you predicted. – Ajay Brahmakshatriya Apr 18 '18 at 12:29
  • My question is this - Does Linux maintain a list of zero physical pages? Not idle zeroing __after__ mapping, but zero'd out before being mapped anywhere? If it maintains such a pool, I think that would be faster than `memset` too. – Ajay Brahmakshatriya Apr 18 '18 at 12:31
  • The above results are for the code I posted. But for real benchmark `memset`(45.01s) is slower than `mmap`ing again(38.35s). I think that is because in case of `memset` I also need to call `mprotect`. This benchmark is a test case where I have implemented `mmap` which is being used by `dlmalloc` and that is used by `gcc`. I am not sure how the access patterns are after clearing the pages. – Ajay Brahmakshatriya Apr 18 '18 at 12:39
1

Note: this is not an answer,I just needed the formatting feature.


BTW: it is possible that the /dev/zero all-zeros page doesn't even exist, and that the .read() method is implemented as follows (a similar thing happens for dev/null, which just returns the length argument):


struct file_operations {
        ...
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ...
        };

static ssize_t zero_read (struct file *not_needed_here, char __user * buff, size_t len, loff_t * ignored)
{
memset (buff, 0, len);
return len;
}
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I think I understand the matter for confusion blin our disagreement. I want to add that in my benchmarking I am not just including the time to munmap and mmap. But I am measuring time for the entire application. So like you said, the pages may be actually mapped cleared later on write, I agree. I am also counting that time essentially. – Ajay Brahmakshatriya Apr 18 '18 at 16:36
  • First show us some code. We don't know your program , we don't know the way you measure/benchmark . – wildplasser Apr 18 '18 at 16:46
  • [this](https://ideone.com/NzSo9H) is a small test example. My real benchmark is actually a implementation of mmap (mmap requires the zeroing of virtual address returned) which has dlmalloc running on top of it. This dlmalloc implementation is being used by gcc. (gcc from spec_cpu_2006). – Ajay Brahmakshatriya Apr 18 '18 at 16:50
  • The time is measured for the entire program run. – Ajay Brahmakshatriya Apr 18 '18 at 16:52
  • Your test program wastes too many cycles on system calls. (bymalling/unmapping per cycle) I was able to reduce the time spent in system calls from over 5 seconds to below 0.5 seconds – wildplasser Apr 18 '18 at 17:52
  • Oh, that's great. How does memset vs mmap compare for you? Which one takes more time? – Ajay Brahmakshatriya Apr 18 '18 at 18:01