Allocating memory aligned buffers for SIMD; how does |16 give an odd multiple of 16, and why do it?

Question

I am working on a C++ function to allocate multiple buffers in memory. The buffers have to be N-byte aligned since the data they hold will be processed with various types of SIMD instruction sets (SSE, AVX, AVX512, etc...)

In the Apple Core Audio Utility Classes online I found this piece of code:

void CABufferList::AllocateBuffers(UInt32 nBytes)
{
    if (nBytes <= GetNumBytes()) return;

    if (mABL.mNumberBuffers > 1) {
        // align successive buffers for Altivec and to take alternating
        // cache line hits by spacing them by odd multiples of 16
        nBytes = ((nBytes + 15) & ~15) | 16;
    }
    UInt32 memorySize = nBytes * mABL.mNumberBuffers;
    Byte *newMemory = new Byte[memorySize], *p = newMemory;
    memset(newMemory, 0, memorySize);   // get page faults now, not later

    AudioBuffer *buf = mABL.mBuffers;
    for (UInt32 i = mABL.mNumberBuffers; i--; ++buf) {
        if (buf->mData != NULL && buf->mDataByteSize > 0) {
            // preserve existing buffer contents
            memcpy(p, buf->mData, buf->mDataByteSize);
        }
        buf->mDataByteSize = nBytes;
        buf->mData = p;
        p += nBytes;
    }
    Byte *oldMemory = mBufferMemory;
    mBufferMemory = newMemory;
    mBufferCapacity = nBytes;
    delete[] oldMemory;
}

The code is pretty straight forward however there is one line that I just don't fully grasp:

nBytes = ((nBytes + 15) & ~15) | 16;

I understand it's aligning/quantizing the number of bytes to 16, however I don't understand why it's using a bitwise OR 16 at the end. The comment says: "to take alternating cache line hits by spacing them by odd multiples of 16". Excuse my thickness, but I still don't get it.

So I have three questions:

1) What does | 16; do exactly and why is it done?

2) Considering the context of memory allocation and data access, how and in what terms does | 16; improve the code? From the comments in the code I can guess it is related to cache access, but I don't understand the whole "alternating cache line hits" bit. How does spacing the memory allocation addresses by odd multiples of 16 improve on cache access?

3) Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.

@JesperJuhl: If/when `alignas` does anything for *dynamic* allocation, it's only with C++17. Earlier C++ revisions made it hard to get aligned memory on top of new/delete. — Peter Cordes, Feb 10 '20 at 12:46
@PeterCordes Since no specific standard was specified, I assume the current one (C++17 ATM). I think that is reasonable. — Jesper Juhl, Feb 10 '20 at 12:48
For question (1), `| 16` just makes nBytes an odd multiple of 16, as per the comment above this line in the code. — Paul R, Feb 10 '20 at 12:49
@JesperJuhl: Sure, but does it always Just Work in C++17? Or would you need an over-aligned type for `new` if what you really wanted was an aligned buffer of `float`? — Peter Cordes, Feb 10 '20 at 12:49
`| 16` unconditionally sets that bit in the binary integer, making sure it can't be a multiple of 32. — Peter Cordes, Feb 10 '20 at 12:50
@Peter I don't know. I can't guess "what you really wanted". — Jesper Juhl, Feb 10 '20 at 12:52
Yes, I understand what | 16 does in absolute terms, but in this context why is it done and in what way does it help improve the code? — Luigi Castelli, Feb 10 '20 at 12:57
If you're only wondering about the "why", not the "how" part of getting an odd multiple of 16, I'd suggest editing your question to make that clearer. I took a guess at a specific title in my last edit; please correct if that wasn't what you wanted. Note that it's the *size* being padded to an odd multiple of 16, not the address itself. The comment makes sense to me after noticing that, although I'm not sure what HW details make that useful. Maybe to avoid the same offset in every buffer aliasing the same set, but cache is associative so it's ok if some are aligned the same. — Peter Cordes, Feb 10 '20 at 21:12
@PeterCordes I am wondering about the how as well. When I say "in what way" I mean how. I will follow your suggestion and edit the answer to make that clearer. BTW, thanks for editing the title. With your edit it is now clearer and more specific than what I had. — Luigi Castelli, Feb 11 '20 at 07:27
`32 | 16` = 48. `48 | 16` = 48. Same thing applies regardless of other high bits being set in the value. — Peter Cordes, Feb 11 '20 at 07:30
That entire allignment stuff only appears meaningful to me at all if `operator new` itself already returns appropriately aligned buffers. Otherwise, we just could adjust the size to a multiple of the standard alignment (presumably 8) and wouldn't get any worse... — Aconcagua, Feb 11 '20 at 11:12
Side-note: If we cannot enforce `new` to align data more strictly than the default (or just don't want to go all the way by providing a custom `operator new`), we could simply allocate additional [required alignment - standard alignment] bytes and move the start of our first buffer inside the allocated array to the first appropriately aligned byte. — Aconcagua, Feb 11 '20 at 11:17
@Aconcagua as noted in some of the above comments, starting from C++17 it is possible to specify alignment requirement when calling operator new. — Luigi Castelli, Feb 11 '20 at 18:35

Andrey Semashev · Accepted Answer · 2020-02-11T12:14:32.147

Disclaimer

Based on the comment referring to Altivec, this is specific to the Power architecture, which I'm not familiar with. Also, the code is incomplete, but it looks like the allocated memory is organized in one or multiple adjacent buffers, and the size adjustment only works when there are multiple buffers. We don't know how data is accessed in these buffers. There will be a lot of assumptions in this answer, to the point that it may be totally incorrect. I'm posting it mostly because it's too large for a comment.

Answer (sort of)

I can see one possible advantage of the size modification. First, let's remember some details about Power architecture:

Altivec vector size is 16 bytes (128 bits)
Cache line size is 128 bytes

Now, let's take an example that AllocateBuffers allocates memory for 4 buffers (i.e. mABL.mNumberBuffers is 4) and nBytes is 256. Let's see how these buffers are laid out in memory:

| Buffer 1: 256+16=272 bytes | Buffer 2: 272 bytes | Buffer 3: 272 bytes | Buffer 4: 272 bytes |
^                            ^                     ^                     ^
|                            |                     |                     |
offset: 0                    272                   544                   816

Notice the offset values and compare them against cache line boundaries. For simplicity, let's assume the memory is allocated at the cache line boundary. It doesn't really matter, as will be shown below.

Buffer 1 starts at offset 0, which is the beginning of a cache line.
Buffer 2 starts 16 bytes past the cache line boundary (which is at offset 2*128=256).
Buffer 3 starts 32 bytes past the cache line boundary (which is at offset 4*128=512).
Buffer 4 starts 48 bytes past the cache line boundary (which is at offset 6*128=768).

Note how the offset from the nearest cache line boundary increments by 16 bytes. Now, if we assume that data in each of the buffers will be accessed in 16-byte chunks, in forward direction, in a loop then the cache lines are fetched from memory in a rather specific order. Let's consider the middle of the loop (since in the beginning CPU will have to fetch cache lines for the beginning of every buffer):

Iteration 5
- Load from Buffer 1 at offset 5*16=80, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 352, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 256, we're at its offset 96.
- Load from Buffer 3 at offset 624, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 512, we're at its offset 112.
- Load from Buffer 4 at offset 896, we hit a new cache line boundary and fetch a new cache line from memory.
Iteration 6
- Load from Buffer 1 at offset 6*16=96, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 368, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 256, we're at its offset 112.
- Load from Buffer 3 at offset 640, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 4 at offset 896, we are still using the cache line that was fetched on the last iteration. The cache line boundary is at offset 896, we're at its offset 16.
Iteration 7
- Load from Buffer 1 at offset 7*16=112, we are still using the cache line that was fetched on previous iterations.
- Load from Buffer 2 at offset 384, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 3 at offset 656, we are still using the cache line that was fetched on the last iteration. The cache line boundary is at offset 640, we're at its offset 16.
- Load from Buffer 4 at offset 912, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 896, we're at its offset 32.
Iteration 8
- Load from Buffer 1 at offset 8*16=128, we hit a new cache line boundary and fetch a new cache line from memory.
- Load from Buffer 2 at offset 400, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 384, we're at its offset 16.
- Load from Buffer 3 at offset 672, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 640, we're at its offset 32.
- Load from Buffer 4 at offset 944, we are still using the cache line that was fetched on previous iterations. The cache line boundary is at offset 896, we're at its offset 48.

Note that the order in which new cache lines are fetched from memory does not depend on the order of accessing buffers within each loop iteration. Also, it does not depend on whether the whole memory allocation was aligned to a cache line boundary. Also note that if buffer contents were accessed in reverse order then the cache lines would be fetched in forward order, but still in order.

This ordered cache line fetching may aid hardware prefercher in the CPU, so, when the next loop iteration is executed, the required cache line is already pre-fetched. Without it, every 8th iteration of the loop would require 4 new cache lines in whatever order the buffers are accessed by the program, which could be interpreted as random access to memory and hamper the prefetcher. Depending on the loop complexity, this 4 cache line fetch may not be hidden by the out-of-order execution model and introduce a stall. This is less likely to happen when you only fetch up to 1 cache line per iteration.

Another possible benefit is avoiding address aliasing. I don't know cache organization of Power, but if nBytes is a multiple of a page size, using multiple buffers at once, when each buffer is page-aligned, could result in lots of false dependencies and hamper store-to-load forwarding. Though the code does the adjustment not just in case when nBytes is a multiple of a page size, so aliasing probably was not the main concern.

Am I right thinking that the above function will only work correctly based on the assumption that the new operator will return at least 16-byte aligned memory? In C++ the new operator is defined as returning a pointer to storage with alignment suitable for any object with a fundamental alignment requirement, which might not necessarily be 16 bytes.

Yes, C++ does not guarantee any particular alignment, other than it is suitable for storing any object of fundamental type. C++17 adds support for dynamic allocations for over-aligned types.

However, even with older C++ versions, every compiler also adheres to the target system ABI specification, which may specify alignment for memory allocations. In practice, on many systems malloc returns at least 16-byte aligned pointers and operator new uses memory returned by malloc or similar lower level API.

It's not portable though, and therefore not a recommended practice. If you require a particular alignment, either make sure you're compiling for C++17 or use specialized APIs, like posix_memalign.

Parts of that comment might have been written at different times. e.g. it might have been just "align successive buffers for Altivec" originally (because that was Apple's *first* ISA with SIMD, before x86 and before ARM with NEON. Regardless, I don't think we can or should rule out there being a benefit on other ISAs, especially in-order ARM with potentially limited memory-level parallelism. (Your idea about staggering cache misses might benefit such CPUs the most.) But G4 PPC with AltiVec may have been in-order, or limited OoO exec window: https://en.wikipedia.org/wiki/PowerPC_G4#e600 — Peter Cordes, Feb 11 '20 at 10:43
(also https://en.wikipedia.org/wiki/AltiVec#Implementations). I don't think TLB is likely to be relevant; this small skew won't change which *page* is being accessed very much. But it could affect aliasing for conflict misses in L1d and/or L2 cache. e.g. PPC7450 had an on-die 256k 8-way L2. Possibly also avoiding exact multiples of the page size helps for memory disambiguation (figuring out if a load is reloading a recent store or not, often by looking at only the low bits of an address. e.g. x86 CPUs have 4k-aliasing false dependencies; skewing buffers helps with that.) — Peter Cordes, Feb 11 '20 at 10:49
> I don't think TLB is likely to be relevant; this small skew won't change which page is being accessed very much. -- Yes, you're probably right. I corrected the answer. — Andrey Semashev, Feb 11 '20 at 12:16
@AndreySemashev great answer. Thank you. So, without using this "trick" if we were to instantiate N buffers, with N being a large number (in my code I could potentially be instantiating up to N=1024x1024=1048576 buffers) we could require the CPU to fetch N cache lines in one iteration. By spacing buffers by odd multiples of 16 we would minimize the number of cache line hits per iteration. This will result in "spreading out" more uniformly cache hits/fetches during iterations. It seems to me like a much smarter/beneficial approach to accessing cache. Am I correct thinking along these lines? — Luigi Castelli, Feb 11 '20 at 19:16
Well, it would work best for up to 128/16=8 buffers (and on x86 - 64/16=4 buffers). With 8 buffers you'd be fetching 1 cache line per each iteration (instead of 8 cache lines on every 8th iteration). With more buffers that number will increase. With 1048576 buffers you are fetching 131072 cache lines per iteration. At this point I'm not sure how efficient this technique is, as you may run out of cache. Probably still better than without it, although you will likely be bottlenecked by the system memory anyway. — Andrey Semashev, Feb 11 '20 at 19:38
With this many buffers you may want to consider rearranging the data, if you actually have to process all buffers at once. Interleaving multiple buffer elements into one buffer will make memory accesses more linear and predictable by hardware prefetchers. — Andrey Semashev, Feb 11 '20 at 19:43

score 3 · Answer 2 · answered Feb 11 '20 at 11:12

Re: the "how" part: ORing in one set bit (0x10 aka 16) makes it an odd multiple of 16. Even multiples of 16 have that bit cleared, i.e. they're also multiples of 32. This makes sure that is not the case.

For example: 32 | 16 = 48. 48 | 16 = 48. Same thing applies regardless of other high bits being set in the value after aligning by 16.

Note that it's the allocation size being adjusted here. So if multiple buffers are carved contiguously out of a big allocation, they won't all start at the same alignment relative to a cache-line boundary. As Andrey's answer points out, they might be staggered if they end up having sizes of n * line_size + 16.
It wouldn't help at all if they all get allocated with the start of the buffer aligned at the start of a page by an allocator that falls back to using mmap directly for large allocations (e.g. glibc's malloc). Presumably (at least when this was written), Apple didn't do that.

Requests for buffer sizes of a large power of 2 are probably not rare.

Note that this comment is probably old: Altivec was Apple's first ISA with SIMD, before they adopted x86, and before they made iPhone with ARM + NEON.

Skewing your buffers (so they're not all aligned the same relative to a page, or maybe a cache line) is still useful on x86, and probably also on ARM.

The use-cases for these buffers must include loops that access two or more of them at the same indices. e.g. A[i] = f(B[i]).

The performance reasons for this might include:

avoid cache-bank conflicts on x86 Sandybridge-family (https://www.agner.org/optimize/blog/read.php?i=142 ; and Agner Fog's microarch pdf)
avoid conflict misses when accessing more arrays than the L1 or L2 cache associativity in one loop. If one array has to get evicted to make room to cache the other one, it might happen once per whole line instead of once per SIMD vector within a line.
avoid memory disambiguation false dependencies for stores (4k aliasng). e.g. L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes. x86 Intel CPUs only look at the low 12 bits of store / load addresses as a quick first check for whether a load overlaps an in-flight store. A store with the same offset within a 4k page as a load effectively aliases it until the hardware figures out that it actually doesn't, but that delays the load. I wouldn't be surprised if memory-disambiguation on PPC had a similar fast path.
Andrey's guess about staggering cache misses: I like that idea, and it would be more important on early PowerPC CPUs with limited out-of-order execution windows (and presumably limited memory-level parallelism) compared to modern high-end x86 and Apple's high-end ARM. https://en.wikipedia.org/wiki/AltiVec#Implementations. It might also help on modern in-order ARM CPUs (which might also have limited memory-level parallelism). Some Apple devices have used in-order ARM, I'm sure, at least as low-power cores for big.LITTLE setups.

(When I say "avoid", sometimes this is just "reduce the likelihood of".)

Thanks for your answer, Peter. It just so happens that in my code I am allocating a large amount of buffers of the same size. The size being a power of 2... and I am on x86. So it would probably be beneficial to stagger the buffers this way. — Luigi Castelli, Feb 11 '20 at 19:10

Allocating memory aligned buffers for SIMD; how does |16 give an odd multiple of 16, and why do it?

2 Answers2

Disclaimer

Answer (sort of)