3

Caching is a core thing when it comes to efficiency.

I know that caching usually happens automatically.

However, I'd like to control cache usage myself, because I think that I can do better than some heuristics that don't know the exact program.

Therefore I would need assembly instructions to directly move to or from cache memory cells.

like:

movL1 address content

I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.

Are there any assemblers that allow for complete cache control?

Side note: why I'd like to improve caching:

consider a hypothetical CPU with 1 register and a cache containing 2 cells.

consider the following two programs:

(where x,y,z,a are memory cells)

"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move z to x"
"move y to x"
"END"

"START"
"move 1 to x"
"move 2 to y"
"move 3 to z"
"move 4 to a"
"move a to x"
"move y to x"
"END"

In the first case, you'd use the register and the cache for x,y,z (a is only written to once) In the second case, you'd use the register and the cache for a,x,y (z is only written to once)

If the CPU does the caching, it simply can't decide ahead of time which of the two above cases it's facing.

It has to decide for each of the memory cells x,y,z if its contents should be cached before it knows if the program executed, is no. 1 or no. 2, because both programs start out the same.

The programmer on the other hand knows ahead of time which memory cells are reused, and when they are reused.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
KGM
  • 287
  • 2
  • 9
  • 1
    It's not a question of assembler. It's a question of architecture that you forgot to specify. The short answer is no. – Jester Jun 05 '20 at 17:31
  • 2
    This isn't really a matter of assemblers supporting it, but of low-level processor control. I know that on some (many?) the initial BIOS/other code will use cache as RAM before DDR has been initialized. Though you'll probably have a hard time figuring out how to do that, the low-level processor details are usually hidden behind NDAs. This isn't relevant to your question, but interesting. – Thomas Jager Jun 05 '20 at 17:31
  • 1
    On most ISAs, no. The only way to use cache is as a transparent cache that you load/store through. Xeon Phi can configure its HBM as either a cache or a separate "local memory". x86 CPUs can run in cache-as-RAM no-fill mode, used by the BIOS in early startup before configuring DRAM controllers. But that's really just no fills on read or write, and read-as-zero for invalid lines, so you can't use DRAM at all in that mode. – Peter Cordes Jun 05 '20 at 17:33
  • sorry, I'm a noob at assembly, could you please explain this simpler? whats a CPU "mode"? whats that HBM? How to set a CPU mode? what are NDAs? – KGM Jun 05 '20 at 17:36
  • 1
    @ThomasJager: no-fill mode is not super secret; Coreboot (open source x86 firmware) uses it, and it might even be a documented control-register or MSR setting. It's what I thought of, too. [What use is the INVD instruction?](https://stackoverflow.com/q/41775371) / [Cache-as-Ram (no fill mode) Executable Code](https://stackoverflow.com/q/27699197) have some details. – Peter Cordes Jun 05 '20 at 17:37
  • 1
    @KGM: HBM = https://en.wikipedia.org/wiki/High_Bandwidth_Memory. But I was misremembering. [Xeon Phi](https://en.wikipedia.org/wiki/Xeon_Phi)'s fast memory that can be configured as architecturally visible or a transparent last-level cache is "[MCDRAM](https://en.wikipedia.org/wiki/MCDRAM)", a competitor to HBM. At 16GiB, it's vastly bigger than on-die L1/L2 caches. – Peter Cordes Jun 05 '20 at 17:39
  • so that's some sort of "future technology" replacing the old L1, L2, L3 caches? Or is it just a supplement? Will it replace the registers too? (I don't know, maybe future CPUs will have thousands of registers.) How fast are these mega caches compared to registers? – KGM Jun 05 '20 at 17:43
  • As you can see from the wiki article I linked, it's in addition to Xeon Phi's L1 and L2 caches. (MCDRAM is its L3.) It's the same idea as CPUs with EDRAM as an L4 cache, like Intel with Iris Pro graphics, or those Broadwell desktop CPUs that had EDRAM. So it's in addition, not instead, as part of the cache hierarchy. – Peter Cordes Jun 05 '20 at 17:58

3 Answers3

6

Peter Cordes wrote:

On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.

This is correct, but the exceptions are of interest....

It is common in DSP ("Digital Signal Processing") chips to provide a limited ability to partition SRAM between "cache" and "scratchpad memory" functionality. There are lots of white papers and reference guides on this topic -- an example is http://www.ti.com/lit/ug/sprug82a/sprug82a.pdf. In this chip, there are three blocks of SRAM -- a small "Level-1 Instruction" SRAM, a small "Level-1 Data" SRAM, and a larger "Level-2" SRAM. Each of the three can be partitioned between Cache and directly-addressed memory, with the details depending on the specific chip. For example, a chip may allow no cache, 1/4 SRAM as cache, 1/2 SRAM as cache, or all SRAM as cache. (The ratios are limited so the allowed cache sizes can be indexed efficiently.)

The IBM "Cell" processor (used in the Sony PlayStation 3, released in 2006) was a multi-core chip with one ordinary general-purpose core and eight co-processor cores. The co-processor cores had a limited instruction set, with load and store instructions that could only access their private 128KiB "scratchpad" memory. In order to access main memory, the co-processors had to program a DMA engine to perform a block copy of main memory to local scratchpad memory (or vice versa). This approach provided (and required) perfect control over data motion, resulting in (a very small amount of) very high-performance software.

Some GPUs also have small on-chip SRAMs that can be configured as either an L1 cache or as explicitly controlled local memory.

All of these are considered to be "very hard" (or worse) to use, but this can be the right approach if the product requires very low cost, completely predictable performance, or very low power.

John D McCalpin
  • 2,106
  • 16
  • 19
  • Well, interesting... So I see that you CAN actually do that! At least with some CPUs and GPUs, and that it greatly boosts performance! However, I am somehow worried about the market share of that CPUs... Are they common? How about AMD and Intel? Can you do such things with high-market-share AMD and or Intel CPUs? if yes, that would be very interesting! (I am considering market share on Servers, and/or Mobiles and/or Desktops and/or Laptops. A CPU capable of that with a high market share in any of these categories would be interesting. Especially one with high market share in the server cat.) – KGM Jun 06 '20 at 12:58
5

On most microarchitectures for most ISAs, no, you can't pin a line in cache to stop it from being evicted. The only way to use cache is as a transparent cache that you load/store through.

Of course, a normal load will definitely bring a cache line into L1d cache, at least temporarily. Nothing stops it from being evicted later, though. e.g. on x86-64: mov eax, [rdi] instead of prefetcht0 [rdi].

Before dedicated prefetch instructions existed, using a plain load as a prefetch was sometimes done (e.g. ahead of some loop-bounds calculations before entering a loop that would start looping over an array). For performance purposes, best-effort software prefetch instructions that the CPU can ignore are usually better.

A plain load has the downside of not being able to retire from the out-of-order back-end until the loaded data actually arrives. (At least I think it can't on x86 CPUs with x86's strongly ordered memory model. Weakly-ordered ISAs that allow out-of-order loads might let the load retire even if it hasn't truly completed yet.) Software prefetch instructions exist to allow prefetch as a hint without bottlenecking the CPU on waiting for the load to finish.

On modern x86, forced eviction of a cache is possible. NT stores guarantee that on Pentium-M or newer, or CPUs after Pentium-M, I forget which. Also, clflush and clflushopt exist specifically for that.

clflush is not just a hint that the CPU can drop; it guarantees correctness for non-volatile DIMMs like Optane DC PM. Why does CLFLUSH exist in x86?

Being guaranteed, not just a hint, makes it slow. You generally don't want to do this for performance. As @old_timer says, burning instructions / cycles micro-managing the cache is almost always a waste of time. Leaving things up to the hardware's pseudo-LRU replacement and HW prefetch algorithms usually provide good results in the long run. SW prefetch can help in a few cases.


Xeon Phi can configure its MCDRAM as a large last-level cache, or as architecturally visible "local memory" that's part of physical address space. But at 6 to 16GiB, it's vastly bigger than the on-die L1/L2 caches, or the L1/L2/L3 caches of modern mainstream CPUs.

Also, x86 CPUs can run in cache-as-RAM no-fill mode, used by the BIOS in early startup before configuring DRAM controllers. But that's really just no fills on read or write, and read-as-zero for invalid lines, so you can't use DRAM at all when no-fill-mode is activated. i.e. only cache is available, and you have to be careful not to evict anything that was cached. It's not usable for any practical purpose except early-boot.

What use is the INVD instruction? and Cache-as-Ram (no fill mode) Executable Code have some details.

I know that there are some instructions that give the "caching system" hints, but I'm not sure if that's enough because the hints could be ignored or they maybe aren't sufficient to express anything expressable by such a move to/from cache order.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Does MCDRAM need to be initialized by the firmware like DRAM or is it immediately usable at reset? – Melab Jan 06 '23 at 00:49
  • @Melab: I have no idea what the power-on default state is. I wouldn't be surprised if it's like modern mainstream x86 CPUs where the first thing firmware does is put the CPU into "cache as RAM" mode while it initializes the DRAM controllers. Or maybe it doesn't have to since the MCDRAM is soldered on Xeon Phi, not socketted like mainstream mobos (and thus might not be present). Are you planning to write your own open-source BIOS/firmware for Xeon Phi systems / cards? – Peter Cordes Jan 06 '23 at 00:55
2

Direct access to the cache srams has nothing to do with the instruction set, if you have access then you have access and you access it however the chip/system designers implemented it. It could be as simple as an address space or it may be some indirect peripheral like access where you poke at control registers and that logic accesses that item in the cache for you.

And this doesn't mean that all ARM processors can gain access to their cache in the same way. (arm is an IP company not a chip company) but it might mean that no you can't do this on any existing x86s. I know for a fact on the product I am part of we can do this because we have ECC on those SRAMs and have an access method to initialize the rams from software before enabling the monitor. Some of the srams you can do it through normal accesses, but for example the arm we are using was implemented with parity checking not ECC so we added ECC on the SRAM and a side door access for init because trying to go through the cache with normal accesses and get 100% coverage was a PITA and end the end not the right solution.

Also worked on a product where the dram controller cache can be used direct access as an on chip ram, up to software decide how to use it as an L2 cache or as on chip ram.

So it has and can be done, and these are isolated examples. As part of screening the parts there are mbist tests that run, but often those are driven through jtag and not directly available to the processor and/or the ram isn't, sometimes the mbist can be started and checked by software but the ram can't, and some implementations, the designers made it so software can touch all of it, including tag ram.

Which leads to if you think you can do a better job than the hardware and want to move stuff around then you will also likely need access to the tag ram as well so that you can trace/drive where you want the cache line, its status, etc.

Based on this comment:

Sorry, I'm a [beginner] at assembly, could you please explain this simpler? whats a CPU "mode"? What's that HBM? How to set a CPU mode? what are NDAs? – KGM

Two things, you can't do better than the cache, and two, you are not ready for this task.

Even with experience you can't generally do better than the cache, if you want to manipulate the cache you use the same knowledge as to how you write your code and where you place it in memory as well as where the data is you are using and then the logic implementation can work better for you. Burning instructions and cycles trying to reposition things runtime isn't going to help. You generally need access to the design at level that is not available to the general public. Thus an NDA (non disclosure agreement), and even then it is extremely unlikely that you will get the info you need and/or the gains will be minimal, may only work on one implementation and not across the whole family of products, etc.

More interesting is what do you think you can do better and how are you thinking you can do it? (also understand that many of us here can make any cache implementation fail and run slower than if it wasn't there, even if you create a newer better cache, by definition it only improves performance in certain cases).

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168
  • for cost/performance reasons the srams in the cache are wired such to be not directly accessible, there would have to be extra busses, control signals, etc in order to do this and it is generally not worth it, unless as mentioned this is for a specific initialization issue or for chip screening of some sort (or failure analysis for specialized use cases, although the extra logic creates extra risk creates more failures). General answer is no there is no access at that level. But there are isolated exceptions. – old_timer Jun 05 '20 at 18:00