Creating an array of array of elements but each array aligned on page and cache boundaries?

Question

I have an array of templated structs, called a Block:

using Block = std::array<T, SIZE>;    // SIZE is a constant

I need to allocate memory for multiple Blocks, the number only known at run-time.

Each block is used by one thread. I wish to allocate the memory to prevent Block/T elements from straddling page and cache boundaries.

I found Linux's getpagesize() but how can I allocate the memory to achieve the alignment?

I am unsure how to achieve the padding due to Block using std::array and the page size only being known at run-time.

This does not need to be portable, the code will only be ran on Linux.

The naive way is to allocate each one in its own page with `mmap`, but if your arrays are often small, packing more than one into a page would be much better. So it doesn't work to just `alignas(4096)` on anything at the C++ level; C++ `alignas` / `alignof` doesn't support the level of sophistication you're aiming for, of not crossing a page boundary but not necessarily being aligned to start on one. — Peter Cordes, Mar 20 '23 at 00:19

score 1 · Answer 1 · answered Mar 20 '23 at 00:24

1

These restrictions seems like a justifiable reason to use mmap to allocate pages.

If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping.

So, you get guaranteed page-alignment from mmap. If a single page per your std::array is a bit excessive you can manage each page yourself: divide it up into chunks that are big enough to straddle cachelines.

Then, use placement new to construct your std::arrays in the mmap-obtained space.

answered Mar 20 '23 at 00:24

Sam Varshavchik

114,536
5
94
148

You still need to know the page size to know whether you can pack another array into the end of an existing page or need a new one. (It's a bit of an odd requirement; normally it's fine to span a page boundary as long as your array is aligned by 64 or whatever so you're not doing a SIMD load that itself spans the page. But maybe they have SIMD code that needs vectors of both `a[i + 0..7]` and `a[i + 1..8]` and can efficiently get them from loads instead of shuffle instructions, or other reasons you might want unaligned SIMD loads even on an aligned array. – Peter Cordes Mar 20 '23 at 00:29
@PeterCordes I was assuming just like cache lines and false sharing, it's better CPU cores don't access the same OS pages? – rare77 Mar 20 '23 at 00:34
@PeterCordes but how would I align each `Block` on 64 bytes, if I create an array of them, they will just be contiguous, depending on the size of the templated struct? I can't align the struct, or I could waste a lot of memory? – rare77 Mar 20 '23 at 00:35
@rare77: incorrect; cache coherency works at cache-line granularity. There's no equivalent effect for pages. The only reason to maybe want different thread's working sets to be disjoint is for dTLB density; if your data is mixed with data from other threads you'll never touch, it'll be spread over more pages. Hardware prefetch running past the end of one array could pull in cache lines that are being used by another core, and HW prefetchers on modern Intel at least work mostly within the same physical page (the L2 streamer), so it can create a small amount of false sharing. – Peter Cordes Mar 20 '23 at 01:05
@rare77: `alignas(64) Block foo` would work. Or for dynamic allocation, use an aligned allocator like `aligned_alloc` or `posix_memalign`. `std::aligned_alloc` has the ridiculous downside that the standard says the behaviour is undefined (or worse that it's actively required to fail, so implementations couldn't define the behaviour) if the total size isn't a multiple of the alignment, so you can't for example portably use it to align an array of 31 floats. See [How to solve the 32-byte-alignment issue for AVX load/store operations?](https://stackoverflow.com/q/32612190) – Peter Cordes Mar 20 '23 at 01:08
@rare77: I think / hope that at least glibc / libstdc++ `std::aligned_alloc` on Linux ignores that braindead requirement. Whoever put that in the standard seems to have forgotten about obvious SIMD use-cases. – Peter Cordes Mar 20 '23 at 01:10
@PeterCordes regarding `alignas(64) foo` do you mean `std::array` will align on a 64 byte boundary AND each foo will still be contiguous? By this I mean if `sizeof(foo)` = 14, I won't have 50 bytes of padding between every element? (which I am trying to avoid). I would just like the start of each `Block` to be aligned, not every single `foo` element? – rare77 Mar 20 '23 at 01:24
@rare77: Right, the `Block` (`std::array`) object will be aligned by 64, unlike if you'd applied `alignas(64)` to the definition of `T`. – Peter Cordes Mar 20 '23 at 07:31

Creating an array of array of elements but each array aligned on page and cache boundaries?

1 Answers1