A lot of the comments are correct, however I thought I might try my hand at a full response.
Firstly, using malloc will not enable you to have explicit control over page mappings as a comment said the malloc part of stdlib will actually allocate a huge chunk of memory after the first allocation.
Secondly when creating a new thread, this will use the same address-space, so there will be no additional mappings created.
I'm going to assume you want to do this from user space, because from kernel space, you can do a lot of things to make this exploration somewhat degenerate (for example you can just try and map pages to the same location).
Instead you want to allocate anonymous pages using mmap.
Mmap is an explicit call to create a Virtual Memory Entry so that when that particular page is accessed for the first time, the kernel can actually put some blank physical memory at that location.
It is the first access to that location that causes the fault, and that first access which will actually use the locks in the PTE and PUD.
Ensuring Good Benchmarking Procedure:
- If you are just trying to stress the page tables you might also want to turn off Transparent Huge Pages within that process. (The syscall to look into is prnctl with the flag DISABLE_THP). Run this before spawning any child processes.
- Pin Threads to cores using cpuset.
- You want to explicitly control your region of interest, so you want to pick specific addresses for each thread that all share the same page table. This way you ensure that the maximum number of locks is used.
- Use a psuedo random function to write to the location that has a different seed for each thread.
- Compare with a baseline that does the exact same thing but that has very different parts of the address space that is stressed.
- Make sure that as little is different between the baseline and the overly contented workload.
- Do not over-subscribe the processor, this will make the overhead due to context-switches which are notorious to root out.
- Make sure to start capturing timing after the threads are created and stop it before they are destroyed.
What does this translate in each thread:
address = <per-thread address>
total = 0;
for(int i = 0; i < N; i++)
{
uint64_t* x = (uint64_t*) mmap((void*) address, 4096, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS, -1, 0); //Maps one page anonymously
assert(x);
*x ^= pseudo_rand(); // Accesses the page and causes the allocation
total += *x; // For fun
int res = munmap((void*) x, 4096); //Deallocates the page (similar locks)
assert(!res);
}
The big take aways are:
- Use
mmap
and explicitly access the allocated location to actually control individual page allocation.
- The compactness of addresses determines what locks are acquired.
- Measuring kernel and virtual memory things requires strict discipline in benchmark procedure.