15

Is there any way to write/read memory without touching L1/L2/L3 cache under x86 CPUs?

And is cache in x86 CPUs totally managed by hardware?

EDIT: I want to do this because I want to sample the speed of memory and see if any part of memory's performance degrades.

Michael Tong
  • 412
  • 4
  • 14
  • If you Google _Reading and Writing to memory on an x86 based memory in ANSI C_, what do you see? Just curious. ( I liked this one ***[HERE](http://stackoverflow.com/questions/2554229/memory-alignment-within-gcc-structs)*** ) – ryyker Feb 23 '15 at 22:34
  • @ryyker: The first link I get (rather appropriately) is the wiki page on [segmentation faults](http://en.wikipedia.org/wiki/Segmentation_fault). – wolfPack88 Feb 23 '15 at 22:37
  • 1
    Yes, it's segmentation fault... but I don't think it is "Reading and Writing to memory on an x86 based memory in ANSI C" that get to segmentation fault. What I want is to kind of disable cache, and write or read memory, within the correct boundary of a program – Michael Tong Feb 23 '15 at 22:38
  • Why you need this? For reading - you can do long loop reading several megabytes of continuous memory. By this way the cache will be loaded with this last addresses. Then read the variable you want and it will be loaded directly from RAM because the cache contains recently loaded addresses from the loop. – i486 Feb 23 '15 at 22:42
  • 1
    @i486, I want to sample the speed of memory in kernel and see if there is any part of memory's performance degrading – Michael Tong Feb 23 '15 at 22:47
  • Run memtest86+ or memtest86 – i486 Feb 23 '15 at 22:48
  • Non-temporal loads/stores? I'm not sure whether they are suitable for timing purposes though. – gsg Feb 24 '15 at 04:47
  • 2
    Related question: http://stackoverflow.com/q/37070/1084 – Nathan Fellman Mar 01 '15 at 10:48

2 Answers2

23

The CPU indeed manages its own caches in hardware, but x86 provides you some ways to affect this management.

To access memory without caching, you could:

  1. Use the x86 non-temporal instructions, they're meant to tell the CPU that you won't be reusing this data again, so there's no point in retaining it in the cache. These instructions in x86 are usually called movnt* (with the suffix according to data type, for e.g. movnti for loading normal integers to general purpose registers). There are also instructions for streaming loads/stores that also use a similar technique but are more appropriate for high BW streams (when you load full lines consecutively). To use these, either code them in inline assembly, or use the intrinsics provided by your compiler, most of them call that family _mm_stream_*

  2. Change the memory type of the specific region to uncacheable. Since you stated you don't want to disable all caching (and rightfully so, since that would also include code, stack, page map, etc..), you could define the specific region your benchmark's data-set resides in as uncacheable, using MTRRs (memory type range registers). There are several ways of doing that, you'll need to read some documentation for that.

  3. The last option is to fetch the line normally, which means it does get cached initially, but then force it to clear out of all cache levels using the dedicated clflush instruction (or the full wbinvd if you want to flush the entire cache). Make sure to properly fence these operations so that you can guarantee they're done (and of course don't measure them as part of the latency).

Having said that, if you want to do all this just to time your memory reads, you may get bad results, since most of the CPUs handle non-temporal or uncacheable accesses "inefficiently". If you're just after forcing reads to come from memory, this is best achieved through manipulating the caches LRUs by sequentially accessing a data set that's large enough to not fit in any cache. This would make most LRU schemes (not all!) drop the oldest lines first, so the next time you wrap around, they'll have to come from memory.

Note that for that to work, you need to make sure your HW prefetcher does not help (and accidentally covers the latency you want to measure) - either disable it, or make the accesses stride far enough for it to be ineffective.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • Note that the `clflush` command is relatively recent. I believe it's only available in servers. – Nathan Fellman Mar 01 '15 at 10:47
  • Thanks! Since ideally I will try to avoid modification of applications' code, 2 and 3 seem more helpful. I'll try them! – Michael Tong Mar 02 '15 at 19:59
  • Here's the list of [non-temporal movs in Intel's intrinsics guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movnt) – raphinesse Jan 14 '21 at 09:31
  • NT stores bypass cache, NT load (`movntdqa`) doesn't unless you use it on WC memory. Current CPUs still ignore the NT hint on normal (WB) memory regions. – Peter Cordes Sep 23 '22 at 09:22
8

Leeor preety much listed the most "pro" solutions for your task. I'll try to add to that with another proposal that can achieve same results, and can be written in plain C with a simple code. The idea is making a kernel similar to "Global Random Access" found in the HPCC Challenge benchmark.

The idea of the kernel is to jump randomly through a huge array of 8B values that is generraly 1/2 the size of your physical memory (So if you have 16 GB of RAM you need an 8GB array leading to 1G elements of 8B). For each jump you can read, write or RMW the target location.

This most likely measures the RAM latency because jumping randomly through RAM makes caching very inefficient. You will get extremely low cache hit rates and if you make sufficient operations on the array, you will be able to measure the actual performance of memory. This method also makes prefetching very ineffective as there is no detectable pattern.

You need to take into consideration following things:

  1. Make sure that the compiler does not optimize away your kernel loop (make sure to do something on that array or make something with the values you read from it).
  2. Use a very simple random number generator and do not store the target addresses in another array (that will be cached). I used a Linear Congruential Generator. This way the next address is calculated very rapidly and does not add extra latencies other than those of RAM.
VAndrei
  • 5,420
  • 18
  • 43
  • Thank you, but I try to measure the speed in background and affect the performance of application as little as possible, so taking too much memory is not so good for my occasion. However, it is a good idea for benchmarks, I could use this to evaluate implementation. – Michael Tong Mar 02 '15 at 20:03
  • 3
    Keep in mind that most modern "big" CPUs (the kind you'll generally find in anything from a smart phone or larger), will allow multiple outstanding requests to memory. So if you randomly access a large array, using something like a LCG as suggested, you won't measure true memory access latency, since the CPU will "queue up" N accesses, which will mostly execute in parallel. On recent Intel CPUs, N is something like 10 (google "line buffers"), so you could measure a value as little as 1/10th the true latency. To measure true latency, make sure each memory access depends on the access before it. – BeeOnRope Jun 21 '16 at 23:04
  • 2
    A simple way of doing that is to make your array all zeros and simply add the result of the last lookup to the result of your LCG. Since it is always zero, it doesn't affect the result, but it will force the CPU to resolve each memory access before proceeding to the next. You can get fancier too, by pre-populating the array with random values and using that as your random function. That removes the overhead of the LCG from your timing loop. – BeeOnRope Jun 21 '16 at 23:07