Custom allocator and memory alignment

Question

I'm trying to implement custom allocator to work with std containers based on the requirements here: https://en.cppreference.com/w/cpp/named_req/Allocator

I'm currently trying to implement a linear allocator and I'm having hard time with memory alignment.
After I allocate a block of memory I'm wondering how much padding do I need between each object in the block to optimize cpu read/writes. I'm not sure if the address alignment should be divisible

by the cpu word size (4 bytes on 32 bits machine and 8 bytes on 64 bits machine)
by the sizeof(T)
by the alignof(T)

I read different answers different places.
For example in this question the accepted answers says:

The usual rule of thumb (straight from Intels and AMD's optimization manuals) is that every data type should be aligned by its own size. An int32 should be aligned on a 32-bit boundary, an int64 on a 64-bit boundary, and so on. A char will fit just fine anywhere.

So by that answer it looks like the address alignment should be divisible by sizeof(T).

On this question the second answer state that:

The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words.

So by that answer it looks like the address alignment should be divisble by the cpu word size.

So I'm seeing some conflicted statements on how to optimize data alignment for cpu read/write and I'm not sure if I'm not understanding something correctly or there're some wrong answers? Maybe someone could clear this out for me on what the address alignment should be divisible by.

It depends on what you intend to use the RAM for, if you're allocating large blocks it might be better to page align. There is no one size fits all solution, which is why the compiler pads structs out depending on types at compile time unless told otherwise. If you want to cover yourself in both 32 and 64 bit instances, just align to 64 bit at the cost of some wasted RAM. — Geoffrey, May 07 '20 at 00:57
@Geoffrey So by your answer I can assume that I should align the data on cpu word size and not `sizeof(T)`? — Jorayen, May 07 '20 at 01:05
`sizeof(T)` might be 3 bytes, it depends on the type, so yes, align on CPU word size. Again though, if you are dealing with copies of large blocks of data, alignment to pagesize would be better. — Geoffrey, May 07 '20 at 01:06
@Geoffrey no just normal object data. Also alignment is parameter I pass to the allocator so I could just instantiate it instead this way: `m_Alignment = sizeof(T*)` can't I? And if I align on either 4 bytes or 8 bytes (depending on cpu word size) won't there be always 0 padding if the first address is divisible by 4/8 ? And if so what if I can guarantee that the starting address is divisible by 4/8 won't it be kind of optimization to potentially ensure more objects in the pool? If so anyway I can guarantee such address division with malloc? — Jorayen, May 07 '20 at 01:11

score 3 · Answer 1 · edited May 07 '20 at 16:29

As a general rule-of-thumb (that is, do this unless you have a good reason to do otherwise), you want to align elements of a given C++ type to their alignment, i.e., alignof(T). If the type wants to be aligned to a 32-bit boundary (as int in most common c++ implementation is implemented), it will exhibit a fitting (4-byte) alignment.

Of course, between the base addresses of two different objects of type T there must be at least sizeof(T) bytes of room, which will usually be an integer-valued multiple of its alignment (it is actually quite hard to pass an over-aligned type to a template function, as it will strip any outer alignas attribute).

In most use-cases, you will therefore be fine by doing the following: Find the first base-address in your underlying storage that is aligned to alignof(T) and then go forward from there in steps of sizeof(T).

This way, you will rely on the users of your allocator to tell you what they want. This is exactly what you want, as the optimizer may rely on knowledge about alignment and, e.g., emit SSE aligned loads for arrays of double-precision floats, which will cause your program to crash if they are aligned wrongly.

Going down the rabbit hole

This gives rise to the following possible situations:

Easy type, has word length and word alignment (e.g., int with sizeof(int) = 4 and alignof(int) = 4):

sizeof(T) = 4 and alignof(T) = 4
 0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F 
[aaaaaaaaaa][bbbbbbbbbb][cccccccccc][dddddddddd]

Types which have a size that is a multiple of its alignment (e.g., using T = int[2])

sizeof(T) = 8 and alignof(T) = 4
 0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F 
[aaaaaaaaaaaaaaaaaaaaaa][bbbbbbbbbbbbbbbbbbbbbb]

Overaligned types, which have a larger alignment than size (e.g., using T = alignas(8) char[3]). Here be dragons!

sizeof(T) = 3 and alignof(T) = 8
 0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F 
[aaaaaaa]               [bbbbbbb]

Note that there is unused space in the overaligned example. This is necessary, as objects that are aligned to an 8-byte boundary may not be put anywhere else, leading to potential wastage. The most common use for types such as this is CPU-specific optmizations, such as to prevent false sharing.

Finally, there is the slightly weird case of objects whose size is larger than, but not an integer multiple of their alignment (e.g., using T = alignas(4) char[5];). This is basically just a small extension to the previous example of overaligned types:

sizeof(T) = 5 and alignof(T) = 4
 0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F 
[aaaaaaaaaaaaa]         [bbbbbbbbbbbbb]

While the alignment would make it possible to place the second object at base address 4, there is already an object there.

Putting all these examples together, the number of bytes that needs to be between the base addresses of two objects of type T is:

inline auto object_distance = sizeof(T) % alignof(T) == 0 ? sizeof(T) : sizeof(T) + (alignof(T) - sizeof(T) % alignof(T));

isn't it `auto object_distance = (sizeof(T) + alignof(T) - 1) / alignof(T) * alignof(T);` ? (which is effectively `ceil(sizeof(T), alignof(T))`? — Mooing Duck, May 07 '20 at 16:30
Right now I calculate the padding with `-((int)sizeof(T)) & (alignment - 1)` which works for all the cases you've shown. The question is what the value of `alignment` should be. Right now I use `alignment = sizeof(T*)` to alternate between 4 bytes / 8 bytes depending on the cpu word size. Am I correct for doing this? — Jorayen, May 07 '20 at 16:47
@MooingDuck Both expressions will yield the same result (if I did not make a mistake somewhere). — danielschemmel, May 07 '20 at 18:00
@Jorayen For alignments that are power of two, `x & (alignment - 1)` is equivalent to `x % alignment`, which should help you see how your expression is related to the one I presented. The alignment of `T` should always be `alignof(T)`. — danielschemmel, May 07 '20 at 18:02
@gha.st: Oh, I assumed your math was right. I just provided a variant without a ternary. Ternaries are hard to read. Actually, looking again, you have `sizeof(T) + (alignof(T) - sizeof(T) % alignof(T)`, which is simpler than mine even. — Mooing Duck, May 07 '20 at 18:04
@gha.st yea I understand the alignment of `T` should always be `aligof(T)` what I'm not sure of (maybe you've stated it in your answer but I couldn't understand is the alignment of a memory block allocated with `malloc` for example should be cpu word size? Say we have a memory pool, do we give to users address that aligns with the cpu word size? For example `Allocator.Allocate(20); // asking 20 bytes` the address return should be divisible by the size of pointer (meaning 4 bytes on x32 and 8 bytes on x64) am I correct? — Jorayen, May 07 '20 at 20:17

Nicol Bolas · Answer 2 · 2020-05-07T16:29:15.053

1

After I allocate a block of memory I'm wondering how much padding do I need between each object in the block to optimize cpu read/writes.

Precisely zero padding between objects; you're not allowed to add padding. In the C++ standard library allocator model, your allocator<T>::allocate(count) method is required to allocate sufficient space to store an array of count objects of type T. Arrays in C++ are tightly packed; the offset from one T in the array to another T is required to be sizeof(T).

So you can't insert padding between objects in the allocated storage. You can insert padding at the beginning of the block of memory you allocate, so that you can be accurate with alignof(T) (which your allocator<T>::allocate is also required to respect). But the returned pointer has to be a pointer to the aligned storage for the Ts. So if you have padding in the front of the allocation, you'll need some way to undo the padding when deallocate is called, since it only gets the aligned storage address.

When it comes to alignment of structs that contain fundamental types, you're relying on the compiler to impose its alignment requirements on those structs. So for this definition:

struct U
{
  std::int32_t i;
  std::int64_t j;
};

If the compiler deems that it would be more optimal for int64_t's to be on 8-byte alignments, then the compiler will insert appropriate padding between i and j in U. sizeof(U) will be 16, and alignof(U) will be 8.

Creating that alignment is not your job, and you're not allowed to do it for the compiler. You must simply respect the alignment of any type you're given in your allocator<T>::allocate calls.

edited May 07 '20 at 16:29

answered May 07 '20 at 16:22

Nicol Bolas

449,505
63
781
982

My ultimate intention is to optimize spatial locality, and that's why I try to create different allocators for cases such as: `std::vector>` where all the pointers are linearly stored but all the `*A`s are scattered across memory, what should I do then if one of the requirements of allocator model is to allocate tightly packed memory for an array of `T` and I can't add padding to allow for faster read/writes for the cpu? – Jorayen May 07 '20 at 16:56
Maybe what I should do is create these custom allocators that insert padding for easier read and writes for the cpu but not make them follow the standard allocator model and work with stl containers? And then I could just supply memory from it to `unique_ptr` with custom `deleter` function? – Jorayen May 07 '20 at 16:58
@Jorayen: "*why I try to create different allocators for cases such as: `std::vector>` where all the pointers are linearly stored but all the *As are scattered across memory*" An allocator can't fix that problem. That is the domain of a container. – Nicol Bolas May 07 '20 at 17:13
So when would you want to use a custom memory allocator such as Pool/Linear allocator? – Jorayen May 07 '20 at 17:17
@Jorayen: A "Pool/Linear allocator" for what container? The basic memory allocator wouldn't change the scattered nature of a vector of pointers to objects. You can allocate those `A`s from a pool, but that's none of `vector`'s business, so it's none of `vector`'s allocator's business. – Nicol Bolas May 07 '20 at 17:18
Yea I've understood this and refereed to this in my second comment *"Maybe what I should do is create these custom allocators that insert padding for easier read and writes for the cpu but not make them follow the standard allocator model and work with stl containers? And then I could just supply memory from it to unique_ptr with custom deleter function? "* Still waiting for an answer if that's the correct approach for what I'm looking for – Jorayen May 07 '20 at 17:21
@Jorayen: I don't understand where the "easier read and writes for the CPU" comes from. That would require that these `A`s were allocated misaligned. C++ requires that you construct `A`s in storage aligned in accord with `alignof(A)`. If you don't, you're in UB land. – Nicol Bolas May 07 '20 at 17:22
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/213356/discussion-between-jorayen-and-nicol-bolas). – Jorayen May 07 '20 at 17:30

Custom allocator and memory alignment

2 Answers2

Going down the rabbit hole