Does hardware consolidate multiple code operations into one physical CPU operation?

Question

I've read a 2006 article about how the CPUs do operations on whole l1 cache lines even in situations when you only need to do something with a small fraction of what the l1 line contains(e.g. loading a whole l1 line to write to a Boolean variable is obviously overkill). The article encouraged optimization through managing the memory in a l1 cache friendly way.

Let's say I have two int variables that just happen to be consecutive in memory, and in my code I write to both of them consecutively.

Does the hardware consolidate my two code operations into one physical operation on a single l1 line(granted the CPU has a l1 cache line big enough to hold both variables), or not?

Is there some way to suggest such a thing to the CPU in C++ or C?

If the hardware doesn't do consolidation in any way, then do you think that it can yield better performance if such a thing is implemented in code? Allocating a memory block that is the size of the l1 line and populating it with as many hot-data variables as possible?

Can you post a link or the name of the article you read? Also, when you say CPU, is it a RISC or CISC or regardless of the architecture? — SSC, Oct 09 '14 at 07:05
http://www.farbrausch.de/~fg/seminars/lightspeed.html This is the page where you can see the various materials on the presentation. And when I say CPU I am talking about a generic x86. — BMC, Oct 09 '14 at 07:08
You might be interested in [*Ulrich Drepper's "What every programmer should know about memory"*](http://lwn.net/Articles/255364/), which illustrates write-combining builtins available on Linux and describes a lot of related CPU architecture and issues. — Tony Delroy, Oct 09 '14 at 07:58
I think you might have misinterpreted what the article says. The CPU won't *load* into any register a full cache line, it will read/dump to the next cache/memory by cache lines, but not between L1 and the CPU. — David Rodríguez - dribeas, Oct 09 '14 at 08:57

score 5 · Accepted Answer · answered Oct 09 '14 at 07:43

The size of the cache line is primarily relevant for concurrency. It is the smallest block of data that can be synchronized between multiple processors.

It is also as you suggest, necessary to load the whole cache line to do an operation on just a few bytes of it. If you do multuiple operations on the same processor though it won't need to continually reload it. It is actually cached, as the name suggests. This includes caching the writes to the data. So long as only one processor is accessing the data you can usually rest assured that it is doing it efficiently.

In cases where multiple processors access the data it may be helpful to align data. Using the C++ alignas attribute, or compiler extensions, can help you get data structures that are aligned in the way you want.

You may be interested in my article CPU Reordering – What is actually being reordered? which gives a few insights to what happens (at least logically) at the low level.

score 3 · Answer 2 · answered Oct 09 '14 at 07:39

It is quite a broad question, but I will try to cover the main points.

Yes, reading data into the cache ONLY look at a single bool is a bit wasteful - however, the processor typically don't KNOW what you are planning to do after that, if you for example need the next consecutive value or not. You can rely on data being in the same class or struct to be located next/near to each other, so using that to store data that you often operate on together close to each other will give you that benefit.

As for operating on "more than one piece of data at once", most modern processors have various forms of extensions to do the same operation on multiple data items (SIMD - same instruction, multiple data). This started with MMX in the late 1990's, and has been extended to include 3DNow!, SSE and AVX for x86. In ARM there is the "Neon" extension, which also provides similar functionality. PowerPC also have something similar whose name escapes me at the moment.

There is no way for a C or C++ program to immediately control the choice of instruction, or cache-usage. But modern compilers, given the right options, will produce code that for example use SIMD instructions to sum all int in a larger array, by adding 4 items at a time, and then, when the whole lot has been done, horizontally add the 4 values. Or if you have a set of X, Y, Z coordinates, it may well use SIMD to add two sets of such data together. It is the choice of the compiler to do this, but it is something that can save quite a bit of time, so the optimisers in the compiler is being modified to find cases where this helps, and use these types of instructions.

And finally, most larger modern processors (x86 since ca 1995, ARM A15, PowerPC) also do superscalar execution - executing more than one instruction at a time, and with "out of order execution" (the processor understands the dependencies of the instructions and executes ones that are "ready" to execute, not exactly in the order that they were given to the processor). The compiler will know this and try to "help" arrange the code such that the processor gets an easy task.

I assume that SIMD only applies to cases when you do exactly the same operations with more than one variable? What if I am writing different data to different variables, but the memory addresses are arranged such that the l1 cache fetches both variables in one line? Will the CPU recongnise the opportunity to write to both or will it load the same line again for the second time? — BMC, Oct 09 '14 at 08:13
The purpose of the cache is to "hold small amounts of data that you may access soon after the first access to the cacheline", so the processor will hold onto the contents of the cache until it needs that same cache-location for something else. Most caches have a number of "ways", meaning different locations that correspond to a particular address - so it won't IMMEDIATELY throw out the content of a particular cache-line if some other data comes from a "matching" address [caches are typically organised based on the address of the data]. — Mats Petersson, Oct 09 '14 at 20:29
The processor may well perform another operation on the next piece of data in the cache immediately as the data becomes available (potentially the same clock-cycle, or the next clock-cycle, as the data arrived from the first instruction requesting that cache-line), assuming it is a super-scalar processor. In a traditional single-instruction at a time processor, it would have to wait until the current instruction has completed before it does the next one. — Mats Petersson, Oct 09 '14 at 20:31

Steve · Answer 3 · 2015-08-09T02:51:26.573

The whole point of caching is to allow a lot of highly localized memory operations to happen quickly.

The fastest operations involve registers, of course. The only delay involved in using them is in the instruction fetching, decoding, and execution. In some register rich architectures (and in vector processors), they're actually used like a specialized cache. And all but the slowest processors have one or more levels of cache that looks like memory to ordinary instructions, except faster.

To simplify relative to actual processors, consider a hypothetical processor that runs at 2 GHz (0.5 ns per clock), with memory that takes 5 ns to load an arbitrary 64 bit (8 byte) word of memory, but only 1 ns to load each successive 64 bit word from memory. (Assume also that writes are similar.) On such a machine, flipping a bit in memory is pretty slow: 1 ns to load the instruction (only if it's not already in the pipeline – but 5 ns after a distant branch), 5 ns to load the word containing the bit, 0.5 ns to execute the instruction, and 5 ns to write the changed word back to memory. A memory copy is better: approximately zero to load instructions (since the pipeline presumably does the right thing with loops of instructions), 5 ns to load the first 8 bytes, 0.5 ns to execute an instruction, 5 ns to store the first 8 bytes, and 1+0.5+1 ns for each additional 8 bytes. Locality makes things easier. But some operations can be pathological: incrementing each byte of an array does the initial 5 ns load, the 0.5 ns instruction, the initial 5 ns store, then 1+0.5+1 per byte (rather than per word) thereafter. (A memory copy that doesn't fall on the same word boundaries is also bad news.)

To make this processor faster, we can add a cache that improves loads and stores to just 0.5 ns over the instruction execution time, for data that's in cache. The memory copy doesn't improve on reading, since it still costs 5 ns for the first 8 byte work and 1 ns for additional words, but the writes get much faster: 0.5 ns for every word until the cache fills, and at the normal 5+1+1, etc. rate after it fills, in parallel with other work that uses memory less. The byte increments improve to 5 ns for the initial load, 0.5+0.5 ns for the instruction and write, then 0.5+0.5+0.5 ns for each additional byte, except during cache stalls on read or write. More repetition of the same few addresses increase the proportion of cache hits.

What happens with real processors, multiple levels of cache, etc.? The simple answer is that things get more complicated. Writing cache-aware code includes trying to improve locality of memory access, analysis to avoid thrashing the cache, and a whole lot of profiling.

score 1 · Answer 4 · answered Feb 13 '20 at 04:21

Yes, back-to-back writes to adjacent int32_t of a cache line can be coalesced in the store buffer of some CPUs, so they can commit to L1d as a single 8-byte aligned update. (On many non-x86 CPUs, full 32-bit stores avoids an RMW cycle when updating L1d cache, so coalescing byte stores is good: Are there any modern CPUs where a cached byte store is actually slower than a word store?. On Alpha 21264, even coalescing 32-bit stores into 64-bit commits was important).

But coalescing in the store buffer only happens after multiple store instructions executed separately. There aren't CPUs that fuse consecutive loads or stores into a single hardware operation for an execution unit.

Some compilers (e.g. GCC8 and later, IIRC) can coalesce loads/stores to adjacent struct members or local variables into a single asm instruction, e.g. storing 4 chars at once with one 32-bit store. (Or 2 ints on a 64-bit machine). On some ISAs like x86, it will do this even if alignment isn't known.

This does create a single asm operation that accesses multiple C objects. On ISAs with efficient unaligned loads/stores (like x86), it's very often a win. (Cache line splits are not common, and not too expensive. Splits across a 4k boundary are much more expensive on Intel before Skylake, though, like ~100 cycles.)

Using alignas(8) int foo; on a struct member to give the whole struct more alignment might enable this compile-time optimization on ISAs without efficient unaligned loads/stores.

I think ARM ldp/stp (load/store pair) is not bad in the not-fully-aligned case, but in the aligned case it can load or store a pair of registers as a single 64 or 128-bit operation.

Does hardware consolidate multiple code operations into one physical CPU operation?

4 Answers4