2

I'm using sys_brk syscall to dynamically allocate memory in the heap. I noticed that when acquiring the current break location I usually get value similar to this:

mov rax, 0x0C
mov rdi, 0x00
syscall

results in

rax   0x401000

The value usually 512 bytes aligned. So I would like to ask is there some alignment requirements on the break value? Or we can misalign it the way we want?

trincot
  • 317,000
  • 35
  • 244
  • 286
St.Antario
  • 26,175
  • 41
  • 130
  • 318
  • Traditionally, the alignment requirement is 1, as in, there is no alignment requirement. – fuz Feb 19 '18 at 17:23
  • @fuz But if we misalign the current break we run the risk of making our data-structure not power-of-2 aligned. – St.Antario Feb 19 '18 at 17:24
  • 1
    Indeed! That's up to the programmer to manage. – fuz Feb 19 '18 at 17:28
  • @fuz Thanks, understood. I though that there was some convention for that... – St.Antario Feb 19 '18 at 17:29
  • Well yeah, there is the `malloc` abstraction to manage this sort of thing and more. – fuz Feb 19 '18 at 17:36
  • 2
    Note that memory protection only has page granularity, so the OS can only program the hardware to map whole pages into your virtual address space. IDK if anything would ever step on bytes in the same page as the break but outside of the part you "own", but you can definitely access them without faulting. It doesn't make much sense to waste CPU time making system calls to move the break in less than 4k page increments, unless you're really going for tiny code size and don't want to track anything in user-space. – Peter Cordes Feb 19 '18 at 19:45
  • 2
    Kernel code [aligns the break to page size](https://elixir.bootlin.com/linux/latest/source/mm/mmap.c#L224) (as explained by Peter). – Margaret Bloom Feb 19 '18 at 20:37
  • @PeterCordes So it seems reasonable to get the page_size first and then increase the break by the value. But what is the syscall for getting current page size? I found `sysconf(_SC_PAGESIZE)` but there is no such a syscall listed [here](http://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/). – St.Antario Feb 19 '18 at 20:42
  • 1
    x86's page size is 4k. You don't need to query it at run-time when targeting x86. Some other architectures can choose different sizes for the non-hugepage pagesize, but x86 is fixed at 4k. – Peter Cordes Feb 19 '18 at 20:59
  • @MargaretBloom: Oh that's interesting, so you can't use `brk` to keep track of sub-page allocations at all. Repeated `brk( brk(0) - 8)` would free a whole page per iteration, not just 8 bytes, so you can't avoid storing the current break as user-space data even in trivial case where your allocations were stack-like but too big for the actual call-stack. But anyway, I don't really see the point of `brk` for asm experiments; just use `mmap`. – Peter Cordes Feb 19 '18 at 21:08
  • @PeterCordes Not sure if `brk( brk(0) - 8)` will free page after page, the kernel doesn't seem to use the aligned values when `brk < mm->brk`. But yes, `mmap` is waaaay more convenient. – Margaret Bloom Feb 19 '18 at 21:44

1 Answers1

2

The kernel does track the break with byte granularity. But don't use it directly for small allocations if you care at all about performance.


There was some discussion in comments about the kernel rounding the break to a page boundary, but that's not the case. The implementation of sys_brk uses this (with my comments added so it makes sense out of context)

newbrk = PAGE_ALIGN(brk);     // the syscall arg
oldbrk = PAGE_ALIGN(mm->brk); // the current break
if (oldbrk == newbrk)
    goto set_brk;      // no need to map / unmap any pages, just update mm->brk

This checks if the break moved to a different page, but eventually mm->brk = brk; sets the current break to the exact arg passed to the system call (if it's valid). If the current break was always page aligned, the kernel wouldn't need PAGE_ALIGN() on it.


Of course, memory protection has at least page granularity (and maybe hugepage, if the kernel chooses to use anonymous hugepages for this mapping). So you can access memory out to the end of the page containing the break without faulting. This is why the kernel code is just checking if the break moved to a different page to skip the map / unmap logic, but still updates the actual brk.

AFAIK, nothing will ever use that mapped memory above the break as scratch space, so it's not like memory below the stack pointer that can be clobbered asynchronously.

brk is just a simple memory-management system built-in to the kernel. System calls are expensive, so if you care about performance you should keep track of things in user-space and only make a system call at all when you need a new page. Using sys_brk directly for tiny allocations is terrible for performance, especially in kernels with Meltdown + Spectre mitigation enabled (making system calls much more expensive, like tens of thousands of clock cycles + TLB and branch prediction invalidation, instead of hundreds of clock cycles).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847