1

There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).

Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t and one uint8_t). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.

For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.

However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.

My question is two-fold:

  1. In C++17 and above, using alignas, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:

enter image description here

where the upper box is a cache line and the other boxes are instances of our small struct.

  1. Do I actually want this? I cannot really wrap my head around this. Usually, we say if we align our struct to the cache line size, access will be faster, as we just have to load a single cache line. However, seeing my example, I wonder if this is actually true. Wouldn't it be faster to not be aligned, but instead store many more instances in a single cache line?

Thank you so much for your input here. It is much appreciated.

Maxbit
  • 439
  • 5
  • 12
  • Normally I don't worry about this. If I'm using multithreading, then I might depending on the work I'm doing. Multiple objects in the same cache line will suffer from [false sharing] (https://web.archive.org/web/20200302130554/https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads) if multiple threads are modifying those objects separately. – NathanOliver Mar 11 '22 at 14:57
  • I'm trying to get the last squeeze of performance out of my program, which is why I am at the point of optimizing cache alignment. Usually, I would not worry about that as well. – Maxbit Mar 11 '22 at 15:00
  • You might get more performance by making you class a size that is an even divisor of the cache line size. That will guarantee that you don't have an object that is living in two different lines. – NathanOliver Mar 11 '22 at 15:03

2 Answers2

2

To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:

    #include <array>
    #include <iostream>
    
    using namespace std;
    
    template <typename T, size_t BlockAlignment>
    struct __attribute__((packed)) Packer
    {
        static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
        static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
        T &operator[]( size_t index ) { return packed[index]; }
        
        T packed[ NUM_ELEMS ];
        uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
    };
    
    template <typename T, size_t NumElements, size_t BlockAlignment>
    struct alignas(BlockAlignment) PackedAlignedArray
    {
        typedef Packer<T, BlockAlignment> PackerType;
        std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
        T &operator[]( size_t index ) { 
            return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
        }
    };
    
    struct __attribute__((packed)) Foo
    {
        uint64_t a;
        uint64_t b;
        uint8_t c;
    };
    
    int main()
    {
        static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
        constexpr size_t NUM_ELEMENTS = 10;
        constexpr size_t BLOCK_ALIGNMENT = 64;

        PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;

        for ( size_t i=0; i<NUM_ELEMENTS; ++i )
        {
            // Display the memory offset between the current
            // element and the start of the array
            cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) - 
                reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
        }
    
        return 0;
    }

The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:

0
17
34
64
81
98
128
145
162
192
1

You could pack into a cache line by declaring the struct itself as under-aligned, with GNU C __attribute__((packed)) or something, so it has sizeof(struct) = 17 instead of the usual padding you'd get to make the struct size a multiple of 8 bytes (the alignment it would normally have because of having a uint64_t member, assuming alignof(uint64_t) == 8).

Then put it in an alignas(256) T array[], so only the start of the array is aligned.

Alignment to boundaries wider than the struct object itself is only possible in terms of a larger object containing multiple structs; ISO C++'s alignment system can't specify that an object can only go in containers which start at a certain alignment boundary; that would lead to nonsensical situations like T *arr being the start of a valid array, but arr + 1 no being a valid array, even though it has the same type.


Unless you're worried about false sharing between two threads, you should at most naturally-align your struct, e.g. to 32 bytes. A naturally-aligned object (alignof=sizeof) can't span an alignment boundary larger than itself. But that wastes a lot of space, so probably better to just let the compiler align it by 8, inheriting alignof(struct) = max (alignof(members)).

See also How do I organize members in a struct to waste the least space on alignment? for more detail on how space inside a single struct works.


Depending on how your data is accessed, one option would be to store pairs of uint64_t in one array, and the corresponding byte in a separate array. That way you have no padding bytes and everything is a power of 2 size and alignment. But random access to all 3 member that go with each other could cost 2 cache misses instead of 1.

But for iterating over your data sequentially, that's excellent.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you very much for your insightful answer. What exactly do you mean by "naturally-align"? In the first sentence, you state naturally-align = 32 bytes, and in the following sentence, alignof=sizeof. If I get you currently, you would _not_ recommend tightly packing the struct (`__attribute__((packed))`) and then aligning that tightly packed struct with alignof=sizeof? – Maxbit Mar 11 '22 at 16:10
  • @Maxbit: I defined the term "naturally aligned" in the same paragraph: alignof(T) == sizeof(T). If that's not clear, google it; it's a standard thing. If the size was already a power of 2, that would be `alignas(sizeof(T)) T foo;` Otherwise you'd need to manually round up to the next power of 2. In your case, yes, that's 32 bytes. – Peter Cordes Mar 11 '22 at 16:13
  • @Maxbit: If you actually had a machine with 256-byte cache lines, then packing like this would increase density significantly. And would be a good thing *if* your CPU was like modern x86 and supported fully efficient unaligned loads/stores as long as they don't cross a cache-line boundary. Otherwise it might be a tradeoff between data density vs. efficient loading of the uint64_t members. And using `__attribute__((packed))` is non-standard, and could make it more complicated to use efficiently in other cases. – Peter Cordes Mar 11 '22 at 16:17
  • @Maxbit: And it would only be fully efficient for an array of 15. For larger sizes, the 16th object would span a cache-line boundary. Unless you had nested structs of structs like `struct cacheline{ alignas(256) T arr[15]; }` which would result in one padding byte per `struct cacheline`, so you could have an array of that without cache-line splits. But still misalignment within cache lines. – Peter Cordes Mar 11 '22 at 16:18
  • I am still not sure on the difference between natural aligning the struct (`alignof(struct) = next_power_of_two(sizeof(struct))`) and `alignof(struct) = max (alignof(members))`. On the one hand, in your answer, you wrote that natural alignment wastes lots of space (I guess it depends on the next power of two), on the other hand, I don't understand why aligning it to the maximum member alignment would be more efficient (or would it not be)? I am very sorry for the many questions, I am trying to understand the different approaches you describe here (natural align vs max align vs auto pad/align) – Maxbit Mar 11 '22 at 17:04
  • @Maxbit: It should be obviously that storing 17 useful bytes in sizeof(T) == 24 (7 padding bytes to make it a multiple of alignof(T) == 8) packs your data more densely, wasting fewer bytes than in sizeof(T) == 32. – Peter Cordes Mar 11 '22 at 17:09
  • This is indeed obvious, sorry. But is there _any_ reason (or, other use case) in which natural alignment should be preferred over `alignof(struct) = max (alignof(members))`. It's only going to be larger (or equal), and hence will always waste at least as many bytes, wouldn't it? – Maxbit Mar 11 '22 at 17:14
  • 1
    @Maxbit: yes, but it means the whole struct will always be in one cache line (and page). I thought that was the point of why you were over-aligning your struct at all, in case of random access by something that loads all 3 members. Of course that advantage might be outweighed on average by a higher cache hit rate from better density. It could also be relevant for avoiding or reducing false-sharing between threads, if different threads access nearby structs. – Peter Cordes Mar 11 '22 at 17:25
  • Thank you so much for your help! I will run a couple of benchmarks to see the trade off between data density & misaligned access. Maybe that will help to resolve some confusion on my side. – Maxbit Mar 11 '22 at 20:37
  • Well, it has been some time, but I have considered this again and came up with follow up confusion on real world benchmarks here: https://stackoverflow.com/questions/72475623/unaligned-access-performance-on-intel-x86-vs-amd-x86-cpus – Maxbit Jun 02 '22 at 11:37