memset()
appears to be about an order of magnitude faster than mmap()
to get a new zero-filled page, at least on the Solaris 11 server I have access to right now. I strongly suspect that Linux will produce similar results.
I wrote a small benchmark program:
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <strings.h>
#include <sys/time.h>
#define NUM_BLOCKS ( 512 * 1024 )
#define BLOCKSIZE ( 4 * 1024 )
int main( int argc, char **argv )
{
int ii;
char *blocks[ NUM_BLOCKS ];
hrtime_t start = gethrtime();
for ( ii = 0; ii < NUM_BLOCKS; ii++ )
{
blocks[ ii ] = mmap( NULL, BLOCKSIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
// force the creation of the mapping
blocks[ ii ][ ii % BLOCKSIZE ] = ii;
}
printf( "setup time: %lf sec\n",
( gethrtime() - start ) / 1000000000.0 );
for ( int jj = 0; jj < 4; jj++ )
{
start = gethrtime();
for ( ii = 0; ii < NUM_BLOCKS; ii++ )
{
blocks[ ii ] = mmap( blocks[ ii ],
BLOCKSIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
blocks[ ii ][ ii % BLOCKSIZE ] = 0;
}
printf( "mmap() time: %lf sec\n",
( gethrtime() - start ) / 1000000000.0 );
start = gethrtime();
for ( ii = 0; ii < NUM_BLOCKS; ii++ )
{
memset( blocks[ ii ], 0, BLOCKSIZE );
}
printf( "memset() time: %lf sec\n",
( gethrtime() - start ) / 1000000000.0 );
}
return( 0 );
}
Note that writing a single byte anywhere in the page is all that's needed to force the creation of the physical page.
I ran it on my Solaris 11 file server (the only POSIX-style system I have running on bare metal right now). I didn't test madvise()
on my Solaris system because Solaris, unlike Linux, doesn't guarantee that the mapping will be repopulated with zero-filled pages, only that "the system starts to free the resources".
The results:
setup time: 11.144852 sec
mmap() time: 15.159650 sec
memset() time: 1.817739 sec
mmap() time: 15.029283 sec
memset() time: 1.788925 sec
mmap() time: 15.083473 sec
memset() time: 1.780283 sec
mmap() time: 15.201085 sec
memset() time: 1.771827 sec
memset()
is almost an order of magnitude faster. When I get a chance, I'll rerun that benchmark on Linux, but it'll likely have to be on a VM (AWS etc.)
That's not surprising - mmap()
is expensive, and the kernel still needs to zero the pages at some time.
Interestingly, commenting out one line
for ( ii = 0; ii < NUM_BLOCKS; ii++ )
{
blocks[ ii ] = mmap( blocks[ ii ],
BLOCKSIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
//blocks[ ii ][ ii % BLOCKSIZE ] = 0;
}
produces these results:
setup time: 10.962788 sec
mmap() time: 7.524939 sec
memset() time: 10.418480 sec
mmap() time: 7.512086 sec
memset() time: 10.406675 sec
mmap() time: 7.457512 sec
memset() time: 10.296231 sec
mmap() time: 7.420942 sec
memset() time: 10.414861 sec
The burden of forcing the creation of the physical mapping has shifted to the memset()
call, leaving only the implicit munmap()
in the test loops, where the mappings are destroyed when the MAP_FIXED
mmap()
call replaces them. Note that the just the munmap()
takes about 3-4 times longer than keeping the pages in the address space and memset()
'ing them to zeros.
The cost of mmap()
isn't really the mmap()
/munmap()
system call itself, it's that the new page requires a lot of behind-the-scenes CPU cycles to create the actual physical mapping, and that doesn't happen in the mmap()
system call itself - it happens afterwards, when the process accesses the memory page.
If you doubt the results, note this LMKL post from Linus Torvalds himself:
...
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
- quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
- ...
Profiling the code using Solaris Studio's collect and analyzer tools produced the following output:
Source File: mm.c
Inclusive Inclusive Inclusive
Total CPU Time Sync Wait Time Sync Wait Count Name
sec. sec.
1. #include <stdio.h>
2. #include <sys/mman.h>
3. #include <string.h>
4. #include <strings.h>
5.
6. #include <sys/time.h>
7.
8. #define NUM_BLOCKS ( 512 * 1024 )
9. #define BLOCKSIZE ( 4 * 1024 )
10.
11. int main( int argc, char **argv )
<Function: main>
0.011 0. 0 12. {
13. int ii;
14.
15. char *blocks[ NUM_BLOCKS ];
16.
0. 0. 0 17. hrtime_t start = gethrtime();
18.
0.129 0. 0 19. for ( ii = 0; ii < NUM_BLOCKS; ii++ )
20. {
21. blocks[ ii ] = mmap( NULL, BLOCKSIZE,
22. PROT_READ | PROT_WRITE,
3.874 0. 0 23. MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
24. // force the creation of the mapping
7.928 0. 0 25. blocks[ ii ][ ii % BLOCKSIZE ] = ii;
26. }
27.
28. printf( "setup time: %lf sec\n",
0. 0. 0 29. ( gethrtime() - start ) / 1000000000.0 );
30.
0. 0. 0 31. for ( int jj = 0; jj < 4; jj++ )
32. {
0. 0. 0 33. start = gethrtime();
34.
0.560 0. 0 35. for ( ii = 0; ii < NUM_BLOCKS; ii++ )
36. {
37. blocks[ ii ] = mmap( blocks[ ii ],
38. BLOCKSIZE, PROT_READ | PROT_WRITE,
33.432 0. 0 39. MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0 );
29.535 0. 0 40. blocks[ ii ][ ii % BLOCKSIZE ] = 0;
41. }
42.
43. printf( "mmap() time: %lf sec\n",
0. 0. 0 44. ( gethrtime() - start ) / 1000000000.0 );
0. 0. 0 45. start = gethrtime();
46.
0.101 0. 0 47. for ( ii = 0; ii < NUM_BLOCKS; ii++ )
48. {
7.362 0. 0 49. memset( blocks[ ii ], 0, BLOCKSIZE );
50. }
51.
52. printf( "memset() time: %lf sec\n",
0. 0. 0 53. ( gethrtime() - start ) / 1000000000.0 );
54. }
55.
0. 0. 0 56. return( 0 );
0. 0. 0 57. }
Compile flags: /opt/SUNWspro/bin/cc -g -m64 mm.c -W0,-xp.XAAjaAFbs71a00k.
Note the large amount of time spent in mmap()
, and also in the setting of a single byte in each newly-mapped page.
This is an overview from the analyzer
tool. Note the large amount of system time:

The large amount of system time consumed is the time taken to map and unmap the physical pages.
This timeline shows when all that time was consumed:

The light green is system time - that's all in the mmap()
loops. You can see that switch over to dark-green user time when the memset()
loops run. I've highlighted one of those instances so you can see what's going on at that time.
Updated results from a Linux VM:
setup time: 2.567396 sec
mmap() time: 2.971756 sec
memset() time: 0.654947 sec
mmap() time: 3.149629 sec
memset() time: 0.658858 sec
mmap() time: 2.800389 sec
memset() time: 0.647367 sec
mmap() time: 2.915774 sec
memset() time: 0.646539 sec
This tracks exactly with what I stated in my comment yesterday: FWIW, a quick test I ran showed that a simple, single-threaded call to memset()
is somewhere between five and ten times faster than redoing mmap()
I simply do not understand this fascination with mmap()
. mmap()
is one hugely expensive call, and it's a forced single-threaded operation - there's only one set of physical memory on the machine. mmap()
is not only S-L-O-W, it impacts both the entire process address space and the VM system on the entire host.
Using any form of mmap()
just to zero out memory pages is counterproductive. First, the pages don't get zeroed for free - something has to memset()
them to clear them. It just doesn't make any sense to add tearing down and recreating a memory mapping to that memset()
just to clear a page of RAM.
memset()
also has the advantage that more than one thread can be clearing memory at any one time. Making changes to memory mappings is a single-threaded process.