1

There are two locks on the PMD and PTE levels of the page table in the Linux kernel. Each time a thread/process allocates/maps a memory it should hold one of these locks to update the page table accordingly. It is obvious that as the number of threads increases the race for holding the locks increases as well. This may degrade the memory mapping throughput since many threads hold the spinlock.

What I'd like to measure for a task, is the worst-case overhead of these locks on memory mapping throughput but I have no idea how to measure it.

I tried using malloc in an infinite-loop as I increase the number of threads running the same loop. I check the /proc/{pid}/maps for each set of running threads to count the number of mapped memories. But I'm not sure if it is the correct way. Besides, this method consumes a lot of memory.

Is there any efficient way to measure the worst-case overhead of these locks?

Mohammad Siavashi
  • 1,192
  • 2
  • 17
  • 48
  • 1
    you make the fundamental assumption that malloc gets that far. typically the c runtime will grab a big chunk of memory for the heap in one go (to avoid the cost of obtaining more memory from the OS) and simply doles it out itself – pm100 Mar 31 '22 at 20:54
  • 1
    This would have to be instrumented in the kernel, you can't do it in user code. – Barmar Mar 31 '22 at 20:54
  • @pm100 Is `mmap` useful in this case? – Mohammad Siavashi Mar 31 '22 at 20:56
  • @Barmar I had the idea of writing a kernel module that counts the `va_area_struct` and call the `mmap` in my userspace process. But isn't it equivalent to the `/proc/{pid}/maps`? I'm a bit confused in this area. – Mohammad Siavashi Mar 31 '22 at 20:58
  • 1
    How does counting them measure the time waiting for locks? – Barmar Mar 31 '22 at 20:59
  • @Barmar I need the throughput, not the latency. What I want to measure is "Does the number of allocations decrease when the number of threads that modifies the page table increases?". At least it counts the number of successfully mapped memories. Although, my assumption that there is a linear relationship between page table and `va_area_struct` might be wrong. – Mohammad Siavashi Mar 31 '22 at 21:02
  • [unix.se] might be a better place to ask this. – Barmar Mar 31 '22 at 21:05
  • "Does the number of allocations decrease when the number of threads that modifies the page table increases?" No, why would it do that? All allocations will occur (or fail) according to available memory, not the number of threads. Do you mean *per second*? or per some other interval? I'm also not clear what exactly you mean by 'worst-case overhead'. The overhead depends on the number of threads: the higher the number the worse the case, but there is no 'worst'. Unclear what you're asking. – user207421 Mar 31 '22 at 23:51
  • @user207421 throughput is always reported in a given interval, usually per second. It is a well accepted fact that the throughout decreases when the contention for a spinlock increases in multi thread applications. This should be no exception. There are many solutions which implemented a lock free solution such as nrOS. Although their implementation aims for replicated PT but the problem they solve is still the same. Worst case of a scenario usually happens when all threads do nothing but content for the lock repeatedly. I assume it couldn't be worst than this scenario. – Mohammad Siavashi Apr 01 '22 at 06:22

1 Answers1

5

A lot of the comments are correct, however I thought I might try my hand at a full response. Firstly, using malloc will not enable you to have explicit control over page mappings as a comment said the malloc part of stdlib will actually allocate a huge chunk of memory after the first allocation. Secondly when creating a new thread, this will use the same address-space, so there will be no additional mappings created.

I'm going to assume you want to do this from user space, because from kernel space, you can do a lot of things to make this exploration somewhat degenerate (for example you can just try and map pages to the same location). Instead you want to allocate anonymous pages using mmap. Mmap is an explicit call to create a Virtual Memory Entry so that when that particular page is accessed for the first time, the kernel can actually put some blank physical memory at that location. It is the first access to that location that causes the fault, and that first access which will actually use the locks in the PTE and PUD.

Ensuring Good Benchmarking Procedure:

  1. If you are just trying to stress the page tables you might also want to turn off Transparent Huge Pages within that process. (The syscall to look into is prnctl with the flag DISABLE_THP). Run this before spawning any child processes.
  2. Pin Threads to cores using cpuset.
  3. You want to explicitly control your region of interest, so you want to pick specific addresses for each thread that all share the same page table. This way you ensure that the maximum number of locks is used.
  4. Use a psuedo random function to write to the location that has a different seed for each thread.
  5. Compare with a baseline that does the exact same thing but that has very different parts of the address space that is stressed.
  6. Make sure that as little is different between the baseline and the overly contented workload.
  7. Do not over-subscribe the processor, this will make the overhead due to context-switches which are notorious to root out.
  8. Make sure to start capturing timing after the threads are created and stop it before they are destroyed.

What does this translate in each thread:

address = <per-thread address>
total = 0;
for(int i = 0; i < N; i++) 
{
 uint64_t* x = (uint64_t*) mmap((void*) address, 4096, PROT_READ | PROT_WRITE, 
     MAP_ANONYMOUS, -1, 0); //Maps one page anonymously
 assert(x);
 *x ^= pseudo_rand(); // Accesses the page and causes the allocation
 total += *x; // For fun
 int res = munmap((void*) x, 4096); //Deallocates the page (similar locks)
 assert(!res);
}

The big take aways are:

  1. Use mmap and explicitly access the allocated location to actually control individual page allocation.
  2. The compactness of addresses determines what locks are acquired.
  3. Measuring kernel and virtual memory things requires strict discipline in benchmark procedure.
tewaro
  • 51
  • 2
  • Thanks for your details answer. I tried the same solution but I think I've missed to access the memory to allocate the physical memory. I'm gonna try that. Thanks – Mohammad Siavashi Apr 01 '22 at 06:25
  • I noted this sentence in your answer: "for example, you can just try and map pages to the same location". Could you please elaborate on this a little more? How can I do this mapping to the same location? Does it modify the page table? Can I do it by writing a kernel module? I'd select as the best answer if you could add tips on how to do this in the kernel and what the benefits are. thanks. @tewaro – Mohammad Siavashi Apr 03 '22 at 11:41
  • 1
    Doing a kernel module is going to be quite similar without the overhead of swapping into the kernel, but if you are trying to understand the overhead from those locks it is better procedure to do it in user space, because you want to know how much of the impact differs from normal calls. There are kernel versions of the mmap calls, and a similar loop should work. You can also make mapping procedures idempotent or even ones that can overwrite each other without checks. Then you can stress the locks a ton, but seems sort of a mute point, because the workload isn't paging. – tewaro Apr 03 '22 at 15:41
  • 2
    I will not that calling `mmap` like that repeatedly will not allocate separate `vm_area_struct`, but will extend existing one. This is the reason why `munmap` [might fail](https://stackoverflow.com/questions/43743555/munmap-failure-with-enomem-with-private-anonymous-mapping) with `ENOMEM`: deallocation can cause a split of `vm_area_struct` which can exceed `vm.max_map_count`. Might not affect locking, but can be useful to know for benchmarking. –  Apr 09 '22 at 13:33
  • 1
    I don't think this will be too much of a problem as there should only be as many threads as there are cores. So unless you are on a 256 core machine and the cap is set quite low will you encounter this bug. To further get rid of this variance of struct merging you can also make baseline mappings do similar merging, by setting up mappings around each of the pointers you will map. But I don't think it will be as much of an overhead as the locks themselves. – tewaro Apr 10 '22 at 19:02