Align struct while minimizing cache line waste

Question

There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).

Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t and one uint8_t). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.

For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.

However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.

My question is two-fold:

In C++17 and above, using alignas, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:

where the upper box is a cache line and the other boxes are instances of our small struct.

Do I actually want this? I cannot really wrap my head around this. Usually, we say if we align our struct to the cache line size, access will be faster, as we just have to load a single cache line. However, seeing my example, I wonder if this is actually true. Wouldn't it be faster to not be aligned, but instead store many more instances in a single cache line?

Thank you so much for your input here. It is much appreciated.

Normally I don't worry about this. If I'm using multithreading, then I might depending on the work I'm doing. Multiple objects in the same cache line will suffer from [false sharing] (https://web.archive.org/web/20200302130554/https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads) if multiple threads are modifying those objects separately. — NathanOliver, Mar 11 '22 at 14:57
I'm trying to get the last squeeze of performance out of my program, which is why I am at the point of optimizing cache alignment. Usually, I would not worry about that as well. — Maxbit, Mar 11 '22 at 15:00
You might get more performance by making you class a size that is an even divisor of the cache line size. That will guarantee that you don't have an object that is living in two different lines. — NathanOliver, Mar 11 '22 at 15:03

Charles Savoie · Accepted Answer · 2022-03-11T16:58:27.987

To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:

    #include <array>
    #include <iostream>
    
    using namespace std;
    
    template <typename T, size_t BlockAlignment>
    struct __attribute__((packed)) Packer
    {
        static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
        static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
        T &operator[]( size_t index ) { return packed[index]; }
        
        T packed[ NUM_ELEMS ];
        uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
    };
    
    template <typename T, size_t NumElements, size_t BlockAlignment>
    struct alignas(BlockAlignment) PackedAlignedArray
    {
        typedef Packer<T, BlockAlignment> PackerType;
        std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
        T &operator[]( size_t index ) { 
            return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
        }
    };
    
    struct __attribute__((packed)) Foo
    {
        uint64_t a;
        uint64_t b;
        uint8_t c;
    };
    
    int main()
    {
        static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
        constexpr size_t NUM_ELEMENTS = 10;
        constexpr size_t BLOCK_ALIGNMENT = 64;

        PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;

        for ( size_t i=0; i<NUM_ELEMENTS; ++i )
        {
            // Display the memory offset between the current
            // element and the start of the array
            cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) - 
                reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
        }
    
        return 0;
    }

The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:

Peter Cordes · Answer 2 · 2022-03-11T16:20:45.923

1

You could pack into a cache line by declaring the struct itself as under-aligned, with GNU C __attribute__((packed)) or something, so it has sizeof(struct) = 17 instead of the usual padding you'd get to make the struct size a multiple of 8 bytes (the alignment it would normally have because of having a uint64_t member, assuming alignof(uint64_t) == 8).

Then put it in an alignas(256) T array[], so only the start of the array is aligned.

Alignment to boundaries wider than the struct object itself is only possible in terms of a larger object containing multiple structs; ISO C++'s alignment system can't specify that an object can only go in containers which start at a certain alignment boundary; that would lead to nonsensical situations like T *arr being the start of a valid array, but arr + 1 no being a valid array, even though it has the same type.

Unless you're worried about false sharing between two threads, you should at most naturally-align your struct, e.g. to 32 bytes. A naturally-aligned object (alignof=sizeof) can't span an alignment boundary larger than itself. But that wastes a lot of space, so probably better to just let the compiler align it by 8, inheriting alignof(struct) = max (alignof(members)).

See also How do I organize members in a struct to waste the least space on alignment? for more detail on how space inside a single struct works.

Depending on how your data is accessed, one option would be to store pairs of uint64_t in one array, and the corresponding byte in a separate array. That way you have no padding bytes and everything is a power of 2 size and alignment. But random access to all 3 member that go with each other could cost 2 cache misses instead of 1.

But for iterating over your data sequentially, that's excellent.

edited Mar 11 '22 at 16:20

answered Mar 11 '22 at 15:33

Peter Cordes

328,167
45
605
847

Thank you very much for your insightful answer. What exactly do you mean by "naturally-align"? In the first sentence, you state naturally-align = 32 bytes, and in the following sentence, alignof=sizeof. If I get you currently, you would _not_ recommend tightly packing the struct (`__attribute__((packed))`) and then aligning that tightly packed struct with alignof=sizeof? – Maxbit Mar 11 '22 at 16:10
@Maxbit: I defined the term "naturally aligned" in the same paragraph: alignof(T) == sizeof(T). If that's not clear, google it; it's a standard thing. If the size was already a power of 2, that would be `alignas(sizeof(T)) T foo;` Otherwise you'd need to manually round up to the next power of 2. In your case, yes, that's 32 bytes. – Peter Cordes Mar 11 '22 at 16:13
@Maxbit: If you actually had a machine with 256-byte cache lines, then packing like this would increase density significantly. And would be a good thing *if* your CPU was like modern x86 and supported fully efficient unaligned loads/stores as long as they don't cross a cache-line boundary. Otherwise it might be a tradeoff between data density vs. efficient loading of the uint64_t members. And using `__attribute__((packed))` is non-standard, and could make it more complicated to use efficiently in other cases. – Peter Cordes Mar 11 '22 at 16:17
@Maxbit: And it would only be fully efficient for an array of 15. For larger sizes, the 16th object would span a cache-line boundary. Unless you had nested structs of structs like `struct cacheline{ alignas(256) T arr[15]; }` which would result in one padding byte per `struct cacheline`, so you could have an array of that without cache-line splits. But still misalignment within cache lines. – Peter Cordes Mar 11 '22 at 16:18
I am still not sure on the difference between natural aligning the struct (`alignof(struct) = next_power_of_two(sizeof(struct))`) and `alignof(struct) = max (alignof(members))`. On the one hand, in your answer, you wrote that natural alignment wastes lots of space (I guess it depends on the next power of two), on the other hand, I don't understand why aligning it to the maximum member alignment would be more efficient (or would it not be)? I am very sorry for the many questions, I am trying to understand the different approaches you describe here (natural align vs max align vs auto pad/align) – Maxbit Mar 11 '22 at 17:04
@Maxbit: It should be obviously that storing 17 useful bytes in sizeof(T) == 24 (7 padding bytes to make it a multiple of alignof(T) == 8) packs your data more densely, wasting fewer bytes than in sizeof(T) == 32. – Peter Cordes Mar 11 '22 at 17:09
This is indeed obvious, sorry. But is there _any_ reason (or, other use case) in which natural alignment should be preferred over `alignof(struct) = max (alignof(members))`. It's only going to be larger (or equal), and hence will always waste at least as many bytes, wouldn't it? – Maxbit Mar 11 '22 at 17:14
1

@Maxbit: yes, but it means the whole struct will always be in one cache line (and page). I thought that was the point of why you were over-aligning your struct at all, in case of random access by something that loads all 3 members. Of course that advantage might be outweighed on average by a higher cache hit rate from better density. It could also be relevant for avoiding or reducing false-sharing between threads, if different threads access nearby structs. – Peter Cordes Mar 11 '22 at 17:25
Thank you so much for your help! I will run a couple of benchmarks to see the trade off between data density & misaligned access. Maybe that will help to resolve some confusion on my side. – Maxbit Mar 11 '22 at 20:37
Well, it has been some time, but I have considered this again and came up with follow up confusion on real world benchmarks here: https://stackoverflow.com/questions/72475623/unaligned-access-performance-on-intel-x86-vs-amd-x86-cpus – Maxbit Jun 02 '22 at 11:37

Align struct while minimizing cache line waste

2 Answers2