You may need to align to a cache line boundary, which is typically 64 bytes per cache line, when you are working with interrupts or high-performance data reads, and they are mandatory to use when working with interprocess sockets. With Interprocess sockets, there are control variables that cannot be spread out across multiple cache lines or DDR RAM words else it will cause the L1, L2, etc or caches or DDR RAM to function as a low-pass filter and filter out your interrupt data! THAT IS BAD!!! That means you get bizarre errors when your algorithm is good and it has the potential to make you go insane!
The DDR RAM is almost always going to read in 128-bit words (DDR RAM Words), which is 16 bytes, so the ring buffer variables shall not be spread out across multiple DDR RAM words. some systems do use 64-bit DDR RAM words and technically you could get a 32-bit DDR RAM word on a 16-bit CPU but one would use SDRAM in the situation.
One may also just be interested in minimizing the number of cache lines in use when reading data in a high-performance algorithm. In my case, I developed the world's fastest integer-to-string algorithm (40% faster than prior fastest algorithm) and I'm working on optimizing the Grisu algorithm, which is the world's fastest floating-point algorithm. In order to print the floating-point number you must print the integer, so in order optimize the Grisu one optimization I have implemented is I have cache-line-aligned the Lookup Tables (LUT) for Grisu into exactly 15 cache lines, which is rather odd that it actually aligned like that. This takes the LUTs from the .bss section (i.e. static memory) and places them onto the stack (or heap but the Stack is more appropriate). I have not benchmarked this but it's good to bring up, and I learned a lot about this, is the fastest way to load values is to load them from the i-cache and not the d-cache. The difference is that the i-cache is read-only and has much larger cache lines because it's read-only (2KB was what a professor quoted me once.). So you're actually going to degrigate your performance from array indexing as opposed to loading a variable like this:
int faster_way = 12345678;
as opposed to the slower way:
int variables[2] = { 12345678, 123456789};
int slower_way = variables[0];
The difference is that the int variable = 12345678
will get loaded from the i-cache lines by offsetting to the variable in the i-cache from the start of the function, while slower_way = int[0]
will get loaded from the smaller d-cache lines using much slower array indexing. This particular subtly as I just discovered is actually slowing down my and many others integer-to-string algorithm. I say this because you may thing you're optimizing by cache-aligning read-only data when you're not.
Typically in C++, you will use the std::align
function. I would advise not using this function because it is not guaranteed to work optimally. Here is the fastest way to align to a cache line, which to be up front I am the author and this is a shamless plug:
Kabuki Toolkit Memory Alignment Algorithm
namespace _ {
/* Aligns the given pointer to a power of two boundaries with a premade mask.
@return An aligned pointer of typename T.
@brief Algorithm is a 2's compliment trick that works by masking off
the desired number of bits in 2's compliment and adding them to the
pointer.
@param pointer The pointer to align.
@param mask The mask for the Least Significant bits to align. */
template <typename T = char>
inline T* AlignUp(void* pointer, intptr_t mask) {
intptr_t value = reinterpret_cast<intptr_t>(pointer);
value += (-value ) & mask;
return reinterpret_cast<T*>(value);
}
} //< namespace _
// Example calls using the faster mask technique.
enum { kSize = 256 };
char buffer[kSize + 64];
char* aligned_to_64_byte_cache_line = AlignUp<> (buffer, 63);
char16_t* aligned_to_64_byte_cache_line2 = AlignUp<char16_t> (buffer, 63);
and here is the faster std::align replacement:
inline void* align_kabuki(size_t align, size_t size, void*& ptr,
size_t& space) noexcept {
// Begin Kabuki Toolkit Implementation
intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
offset = (-int_ptr) & (align - 1);
if ((space -= offset) < size) {
space += offset;
return nullptr;
}
return reinterpret_cast<void*>(int_ptr + offset);
// End Kabuki Toolkit Implementation
}