How and when to align to cache line size?

Question

In Dmitry Vyukov's excellent bounded mpmc queue written in C++ See: http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

He adds some padding variables. I presume this is to make it align to a cache line for performance.

I have some questions.

Why is it done in this way?
Is it a portable method that will always work
In what cases would it be best to use __attribute__ ((aligned (64))) instead.

why would padding before a buffer pointer help with performance? isn't just the pointer loaded into the cache so it's really only the size of a pointer?

static size_t const     cacheline_size = 64;
typedef char            cacheline_pad_t [cacheline_size];

cacheline_pad_t         pad0_;
cell_t* const           buffer_;
size_t const            buffer_mask_;
cacheline_pad_t         pad1_;
std::atomic<size_t>     enqueue_pos_;
cacheline_pad_t         pad2_;
std::atomic<size_t>     dequeue_pos_;
cacheline_pad_t         pad3_;

Would this concept work under gcc for c code?

why use the cacheline_size as pad size, rather than ( cacheline_size - sizeof( std::atomic ) ) ? — Y00, May 06 '20 at 17:38

Phil Miller · Accepted Answer · 2014-02-18T22:59:56.747

55

It's done this way so that different cores modifying different fields won't have to bounce the cache line containing both of them between their caches. In general, for a processor to access some data in memory, the entire cache line containing it must be in that processor's local cache. If it's modifying that data, that cache entry usually must be the only copy in any cache in the system (Exclusive mode in the MESI/MOESI-style cache coherence protocols). When separate cores try to modify different data that happens to live on the same cache line, and thus waste time moving that whole line back and forth, that's known as false sharing.

In the particular example you give, one core can be enqueueing an entry (reading (shared) buffer_ and writing (exclusive) only enqueue_pos_) while another dequeues (shared buffer_ and exclusive dequeue_pos_) without either core stalling on a cache line owned by the other.

The padding at the beginning means that buffer_ and buffer_mask_ end up on the same cache line, rather than split across two lines and thus requiring double the memory traffic to access.

I'm unsure whether the technique is entirely portable. The assumption is that each cacheline_pad_t will itself be aligned to a 64 byte (its size) cache line boundary, and hence whatever follows it will be on the next cache line. So far as I know, the C and C++ language standards only require this of whole structures, so that they can live in arrays nicely, without violating alignment requirements of any of their members. (see comments)

The attribute approach would be more compiler specific, but might cut the size of this structure in half, since the padding would be limited to rounding up each element to a full cache line. That could be quite beneficial if one had a lot of these.

The same concept applies in C as well as C++.

edited Feb 18 '14 at 22:59

answered Dec 12 '11 at 03:23

Phil Miller

36,389
13
67
90

@Novelcrat - Ok that makes a lot of sense. So what about questions 2 & 3? – hookenz Dec 12 '11 at 03:43
13

@MattH: For portability C++11 introduces `std::aligned_storage` which allow you to require a storage of defined size and alignment. The default alignment for a `char [N]` is `1` otherwise. – Matthieu M. Dec 12 '11 at 07:32
1

Why would the linker not optimize the padding variables out if they are not used? – RishiD Aug 26 '13 at 17:37
14

Actually, there is no assumption that "cacheline_pad_t will itself be aligned to a 64 byte;" alignment is actually not required. The padding just guarantees the only goal, namely that the variables before and after are in **different** cache lines. – hrr Feb 17 '14 at 11:22
4

And the more modern C++11 standard has `alignas` declaration modifiers to do this portably. This is supported on just about every actively-developed C++ compiler. – Phil Miller Feb 26 '16 at 16:35
@RishiD c++ object model prevents compilers from dropping unused member variable(s). – HCSF Sep 20 '18 at 13:49
1

@hrr your comment is interesting. So if `mpmc_bounded_queue` has `alignas(64)` in its declaration, it seems like `cacheline_pad_t pad0_` isn't needed as `buffer_` will be aligned to the cache line (assuming cache line size is 64)? In that case, the paddings there can be more "compact" without running into false sharing? Thanks in advance! – HCSF Sep 20 '18 at 14:00
@HCSF Actually, you would need to `alignas` the relevant member variables, because you want them on separate cache lines; doing it on the whole `mpmc_bounded_queue` would not work. It would indeed save an (insignificant) number of bytes in the given example, but would not affect performance. – hrr Oct 24 '18 at 14:11
@hrr I think I see what you mean now -- when `alignas` is applied on `mpmc_bounded_queue`, its member won't be cache line aligned but `mpmc_bounded_queue` is; hence, `buffer_` won't be aligned, right? – HCSF Oct 25 '18 at 11:26
1

why not use ( cacheline_size - sizeof( std::atomic ) ) as pad size? this will be more compact – Y00 May 07 '20 at 08:55

score 7 · Answer 2 · answered Aug 22 '18 at 17:28

You may need to align to a cache line boundary, which is typically 64 bytes per cache line, when you are working with interrupts or high-performance data reads, and they are mandatory to use when working with interprocess sockets. With Interprocess sockets, there are control variables that cannot be spread out across multiple cache lines or DDR RAM words else it will cause the L1, L2, etc or caches or DDR RAM to function as a low-pass filter and filter out your interrupt data! THAT IS BAD!!! That means you get bizarre errors when your algorithm is good and it has the potential to make you go insane!

The DDR RAM is almost always going to read in 128-bit words (DDR RAM Words), which is 16 bytes, so the ring buffer variables shall not be spread out across multiple DDR RAM words. some systems do use 64-bit DDR RAM words and technically you could get a 32-bit DDR RAM word on a 16-bit CPU but one would use SDRAM in the situation.

One may also just be interested in minimizing the number of cache lines in use when reading data in a high-performance algorithm. In my case, I developed the world's fastest integer-to-string algorithm (40% faster than prior fastest algorithm) and I'm working on optimizing the Grisu algorithm, which is the world's fastest floating-point algorithm. In order to print the floating-point number you must print the integer, so in order optimize the Grisu one optimization I have implemented is I have cache-line-aligned the Lookup Tables (LUT) for Grisu into exactly 15 cache lines, which is rather odd that it actually aligned like that. This takes the LUTs from the .bss section (i.e. static memory) and places them onto the stack (or heap but the Stack is more appropriate). I have not benchmarked this but it's good to bring up, and I learned a lot about this, is the fastest way to load values is to load them from the i-cache and not the d-cache. The difference is that the i-cache is read-only and has much larger cache lines because it's read-only (2KB was what a professor quoted me once.). So you're actually going to degrigate your performance from array indexing as opposed to loading a variable like this:

int faster_way = 12345678;

as opposed to the slower way:

int variables[2] = { 12345678, 123456789};
int slower_way = variables[0];

The difference is that the int variable = 12345678 will get loaded from the i-cache lines by offsetting to the variable in the i-cache from the start of the function, while slower_way = int[0] will get loaded from the smaller d-cache lines using much slower array indexing. This particular subtly as I just discovered is actually slowing down my and many others integer-to-string algorithm. I say this because you may thing you're optimizing by cache-aligning read-only data when you're not.

Typically in C++, you will use the std::align function. I would advise not using this function because it is not guaranteed to work optimally. Here is the fastest way to align to a cache line, which to be up front I am the author and this is a shamless plug:

Kabuki Toolkit Memory Alignment Algorithm

namespace _ {
/* Aligns the given pointer to a power of two boundaries with a premade mask.
@return An aligned pointer of typename T.
@brief Algorithm is a 2's compliment trick that works by masking off
the desired number of bits in 2's compliment and adding them to the
pointer.
@param pointer The pointer to align.
@param mask The mask for the Least Significant bits to align. */
template <typename T = char>
inline T* AlignUp(void* pointer, intptr_t mask) {
  intptr_t value = reinterpret_cast<intptr_t>(pointer);
  value += (-value ) & mask;
  return reinterpret_cast<T*>(value);
}
} //< namespace _

// Example calls using the faster mask technique.

enum { kSize = 256 };
char buffer[kSize + 64];

char* aligned_to_64_byte_cache_line = AlignUp<> (buffer, 63);

char16_t* aligned_to_64_byte_cache_line2 = AlignUp<char16_t> (buffer, 63);

and here is the faster std::align replacement:

inline void* align_kabuki(size_t align, size_t size, void*& ptr,
                          size_t& space) noexcept {
  // Begin Kabuki Toolkit Implementation
  intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
           offset = (-int_ptr) & (align - 1);
  if ((space -= offset) < size) {
    space += offset;
    return nullptr;
  }
  return reinterpret_cast<void*>(int_ptr + offset);
  // End Kabuki Toolkit Implementation
}

Could you elaborate on the statement "it will cause the L1, L2, etc or caches or DDR RAM to function as a low-pass filter", or post a link to an explanation? I struggle to understand, how the frequency of the signal plays into this — zomnombom, Oct 28 '20 at 17:53

How and when to align to cache line size?

2 Answers2

Kabuki Toolkit Memory Alignment Algorithm

Linked