How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading

Question

My goal is to load a static structure into the L1D cache. After that performing some operation using those structure members and after done with the operation run invd to discard all the modified cache lines. So basically I want to use create a secure environment inside the cache so that, while performing operations inside the cache, data will not be leaked into the RAM.

To do this, I have a kernel module. Where I placed some fixed values on the members of a structure. Then I disable preemption, disable cache for all other CPU (except current CPU), disable interrupt, then using __builtin_prefetch() to load my static structure into the cache. And after that, I overwrite the previously placed fixed values with new values. After that, I execute invd (to clear the modified cache line) and then enable cache to all other CPUs, enable interrupt & enable preemption. My rationale is, as I'm doing this while in atomic mode, INVD will remove all the changes. And after coming back from atomic mode, I should see the original fixed values that I have placed previously. That is however not happening. After coming out of the atomic mode, I can see the values, that Used to overwrite the previously placed fixed values. Here is my module code,

It's strange that after rebooting the PC, my output changes, I just don't understand why. Now, I'm not seeing any changes at all. I'm posting the full code including some fix @Peter Cordes suggested,

#include <linux/module.h>    
#include <linux/kernel.h>    
#include <linux/init.h>      
#include <linux/moduleparam.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Author");
MODULE_DESCRIPTION("test INVD");

static struct CACHE_ENV{
    unsigned char in[128];
    unsigned char out[128];
}cacheEnv __attribute__((aligned(64)));

#define cacheEnvSize (sizeof(cacheEnv)/64)
//#define change "Hello"
unsigned char change[]="hello";


void disCache(void *p){
    __asm__ __volatile__ (
        "wbinvd\n"
        "mov %%cr0, %%rax\n\t"
        "or $(1<<30), %%eax\n\t"
        "mov %%rax, %%cr0\n\t"
        "wbinvd\n"
        ::
        :"%rax"
    );

    printk(KERN_INFO "cpuid %d --> cache disable\n", smp_processor_id());

}


void enaCache(void *p){
    __asm__ __volatile__ (
        "mov %%cr0, %%rax\n\t"
        "and $~(1<<30), %%eax\n\t"
        "mov %%rax, %%cr0\n\t"
        ::
        :"%rax"
    );

    printk(KERN_INFO "cpuid %d --> cache enable\n", smp_processor_id());

}

int changeFixedValue (struct CACHE_ENV *env){
    int ret=1;
    //memcpy(env->in, change, sizeof (change));
    //memcpy(env->out, change,sizeof (change));

    strcpy(env->in,change);
    strcpy(env->out,change);
    return ret;
}

void fillCache(unsigned char *p, int num){
    int i;
    //unsigned char *buf = p;
    volatile unsigned char *buf=p;

    for(i=0;i<num;++i){
    
/*
        asm volatile(
        "movq $0,(%0)\n"
        :
        :"r"(buf)
        :
        );
*/
        //__builtin_prefetch(buf,1,1);
        //__builtin_prefetch(buf,0,3);
        *buf += 0;
        buf += 64;   
     }
    printk(KERN_INFO "Inside fillCache, num is %d\n", num);
}

static int __init device_init(void){
    unsigned long flags;
    int result;

    struct CACHE_ENV env;

    //setup Fixed values
    char word[] ="0xabcd";
    memcpy(env.in, word, sizeof(word) );
    memcpy(env.out, word, sizeof (word));
    printk(KERN_INFO "env.in fixed is %s\n", env.in);
    printk(KERN_INFO "env.out fixed is %s\n", env.out);

    printk(KERN_INFO "Current CPU %s\n", smp_processor_id());

    // start atomic
    preempt_disable();
    smp_call_function(disCache,NULL,1);
    local_irq_save(flags);

    asm("lfence; mfence" ::: "memory");
    fillCache(&env, cacheEnvSize);
    
    result=changeFixedValue(&env);

    //asm volatile("invd\n":::);
    asm volatile("invd\n":::"memory");

    // exit atomic
    smp_call_function(enaCache,NULL,1);
    local_irq_restore(flags);
    preempt_enable();

    printk(KERN_INFO "After: env.in is %s\n", env.in);
    printk(KERN_INFO "After: env.out is %s\n", env.out);

    return 0;
}

static void __exit device_cleanup(void){
    printk(KERN_ALERT "Removing invd_driver.\n");
}

module_init(device_init);
module_exit(device_cleanup);

And I'm getting the following output:

[ 3306.345292] env.in fixed is 0xabcd
[ 3306.345321] env.out fixed is 0xabcd
[ 3306.345322] Current CPU (null)
[ 3306.346390] cpuid 1 --> cache disable
[ 3306.346611] cpuid 3 --> cache disable
[ 3306.346844] cpuid 2 --> cache disable
[ 3306.347065] cpuid 0 --> cache disable
[ 3306.347313] cpuid 4 --> cache disable
[ 3306.347522] cpuid 5 --> cache disable
[ 3306.347755] cpuid 6 --> cache disable
[ 3306.351235] Inside fillCache, num is 4
[ 3306.352250] cpuid 3 --> cache enable
[ 3306.352997] cpuid 5 --> cache enable
[ 3306.353197] cpuid 4 --> cache enable
[ 3306.353220] cpuid 6 --> cache enable
[ 3306.353221] cpuid 2 --> cache enable
[ 3306.353221] cpuid 1 --> cache enable
[ 3306.353541] cpuid 0 --> cache enable
[ 3306.353608] After: env.in is hello
[ 3306.353609] After: env.out is hello

My Makefile is

obj-m += invdMod.o
CFLAGS_invdMod.o := -o0
invdMod-objs := disable_cache.o  

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
    rm -f *.o

Any thought about what I'm doing incorrectly? As I said before, I expect my output to remain unchanged.

One reason I can think of is that __builtin_prefetch() is not putting the structure into the cache. Another way to put something into the cache is by setting up a write-back region with the help of MTRR & PAT. However, I'm kind of clueless about how to achieve that. I found 12.6. Creating MTRRs from a C programme using ioctl()’s shows how to create a MTRR region but I can't figure out how Can I bind the address of my structure with that region.

My CPU is : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz

Kernel version : Linux xxx 4.4.0-200-generic #232-Ubuntu SMP Wed Jan 13 10:18:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

GCC version : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

I have compiled this module with -O0 parameter

Update 2: Hyperthreading off

I turned off hyperthreading with echo off > /sys/devices/system/cpu/smt/control. After that, running my module seems like, changeFixedValue() & fillCache() are not getting called.

output:

[ 3971.480133] env.in fixed is 0xabcd
[ 3971.480134] env.out fixed is 0xabcd
[ 3971.480135] Current CPU 3
[ 3971.480739] cpuid 2 --> cache disable
[ 3971.480956] cpuid 1 --> cache disable
[ 3971.481175] cpuid 0 --> cache disable
[ 3971.482771] cpuid 2 --> cache enable
[ 3971.482774] cpuid 0 --> cache enable
[ 3971.483043] cpuid 1 --> cache enable
[ 3971.483065] After: env.in is 0xabcd
[ 3971.483066] After: env.out is 0xabcd

@PeterCordes, after changing the asm to `asm volatile("invd\n":::"memory");`, Im still getting the same output. — user45698746, Mar 24 '21 at 03:29
Also, PREFETCHT0 (into L1d cache) is `__builtin_prefetch(p,0,3);`. ([What is the effect of second argument in \_builtin\_prefetch()?](https://stackoverflow.com/q/40513280) shows how it maps to instructions; you're using `prefetchw` depending on compiler options). But really since you need this for correctness, you shouldn't be using optional hints that the HW can drop if it's busy. Use a volatile read like `READ_ONCE` to get GCC to emit a load instruction. Or use `volatile char *buf` and use `*buf += 0;` to RMW to make sure the line is in exclusive state. — Peter Cordes, Mar 24 '21 at 03:30
@PeterCordes, I change `unsigned char *buf = p;` to `volatile unsigned char *buf=p;` , change older prefetching function to `_builtin_prefetch(p,0,3)` and change `buf += 64` to `*buf += 0;`. Still same. Nothing changes. — user45698746, Mar 24 '21 at 03:48
@PeterCordes, Actually, in the `disCache()` function, I'm putting other cores into no-fill mode except current core. — user45698746, Mar 24 '21 at 03:49
You said you changed *`buf += 64` to `*buf += 0;`* - I think you haven't understood what those things do. I'm suggesting `*buf += 0` or `*buf |= 0;` as a replacement for `_builtin_prefetch`, to get `addb $0, (%rdi)` instead of `prefetcht0 (%rdi)`. Of course you still need to increment the pointer by 64 with `buf += 64;` — Peter Cordes, Mar 24 '21 at 04:35
You can look at the compiler-generated asm to make sure the flush code is sane. — Peter Cordes, Mar 24 '21 at 04:43
Provide a minimum, reproducible example. Specify the CPU model, kernel version, and compiler version and parameters. Does it make a difference if you use normal assignment instead `memcpy()` in `changeFixedValue()`? — Hadi Brais, Mar 24 '21 at 12:26
@PeterCordes, yea, I did misunderstand that part. Something strange happens after rebooting my machine. My output changes, please see the full output on the updated post. — user45698746, Mar 24 '21 at 17:36
IDK, seems strange. I think this should work, and the accesses to stack memory from debug-mode compiler output shouldn't be causing conflict misses that evict `env` between the strcpy calls and the `invd`. There aren't any printk calls or other expensive functions in that window. — Peter Cordes, Mar 24 '21 at 17:49
@PeterCordes, just wondering, does hyperthreading has anything to do with it? I try to run with hyperthreading off. My system just stalls. — user45698746, Mar 24 '21 at 17:54
You mean it stalls when you try this experiment, or it can't even boot? I don't know why either of those things would happen; you don't seem to be hard-coding any CPU core numbers that would be wrong or invalid with fewer logical cores in the system. Oh, perhaps the CR0 setting in `enaCache` is per-physical-core, not per-logical-core? If that was true, though, you'd expect disabling HT to make it work correctly, since you need cache enabled on the core doing the stores you want to discard. — Peter Cordes, Mar 24 '21 at 18:07
@PeterCordes No, turning off hyperthreading was ok, my system keep stalls when I try to insert the module. After a reboot, it is working. `perhaps the CR0 setting in enaCache is per-physical-core, not per-logical-core?` I don't think I understand this line. To my understanding, when I turn off `hyperthreading`, all I have is the physical core. therefore, when using `smp_call_function();` to call `enaCache`, I'm indeed enabling cache for all the disabled core. — user45698746, Mar 24 '21 at 18:52
I was wondering what that CR0 setting was doing when your system did have HT enabled, whether perhaps cache was being disabled on the current core by setting CR0 on the sibling logical core. i.e. if there's something wrong with this, why it works with HT enabled. — Peter Cordes, Mar 24 '21 at 18:58
*After that, running my module seems like, changeFixedValue() & fillCache() are not getting called.* - or `invd` is correctly discarding stores done by printk into the kernel log buffer! I hadn't previously noticed that you had a prink inside that cache-disabled region; that could easily cause a crash if some stores get written back because limited cache capacity, but others get discarded. — Peter Cordes, Mar 24 '21 at 19:32
@PeterCordes, yap. That was the case. After commenting out `asm volatile("invd\n":::"memory");`, I can see those `changeFixedValue()` & `fillCache()` functions are getting called and printing those modified values. And after removing comment from `asm volatile("invd\n":::"memory");`, I can see that `invd` is indeed clearing the cache and restore those original fixed values. So, I guess, now I can say that my structure was loading into `L1D cache` and `invd` did clear that L1D cache. — user45698746, Mar 24 '21 at 19:58
@PeterCordes, as your answer already has lots of important details to make this work and you have more knowledge on this, if you can please add these details to your answer, I would like to mark your answer as accepted — user45698746, Mar 24 '21 at 20:22
See @Hadi's comment on my answer about CR0.CD being per-physical-core with Hyperthreading, so there is a real difference in terms of no-fill mode or not. — Peter Cordes, Mar 25 '21 at 15:31

Peter Cordes · Accepted Answer · 2021-03-25T15:29:53.830

It looks very unsafe to call printk at the bottom of fillCache. You're about to run a few more stores then an invd, so any modifications printk makes to kernel data structures (like the log buffer) might get written back to DRAM or might get invalidated if they're still dirty in cache. If some but not all stores make it to DRAM (because of limited cache capacity), you could leave kernel data structures in an inconsistent state.

I'd guess that your current tests with HT disabled show everything working even better than you hoped, including discarding stores done by printk, as well as discarding the stores done by changeFixedValue. That would explain the lack of log messages left for user-space to read once your code finishes.

To test this, you'd ideally want to clflush everything printk did, but there's no easy way to do that. Perhaps wbinvd then changeFixedValue then invd. (You're not entering no-fill mode on this core, so fillCache isn't necessary for your store / invd idea to work, see below.)

With Hyperthreading enabled:

CR0.CD is per-physical-core, so having your HT sibling core disable cache also means CD=1 for the isolated core. So with HT enabled, you were in no-fill mode even on the isolated core.

With HT disabled, the isolated core is still normal.

Compile-time and run-time reordering

asm volatile("invd\n":::); without a "memory" clobber tells the compiler it's allowed to reorder it wrt. memory operations. Apparently that isn't the problem in your case, but it's a bug you should fix.

Probably also a good idea to put asm("mfence; lfence" ::: "memory"); right before fillCache, to make sure any cache-miss loads and stores aren't still in flight and maybe allocating new cache lines while your code is running. Or possibly even a fully serializing instruction like asm("xor %eax,%eax; cpuid" ::: "eax", "ebx", "ecx", "edx", "memory");, but I don't know of anything that CPUID blocks which mfence; lfence wouldn't.

The title question: touching memory to bring it into cache

PREFETCHT0 (into L1d cache) is __builtin_prefetch(p,0,3);. This answer shows how args maps to instructions; you're using prefetchw (write-intent) or I think prefetcht1 (L2 cache) depending on compiler options.

But really since you need this for correctness, you shouldn't be using optional hints that the HW can drop if it's busy. mfence; lfence would make it unlikely for the HW to actually be busy, but still not a bad idea.

Use a volatile read like READ_ONCE to get GCC to emit a load instruction. Or use volatile char *buf with *buf |= 0; or something to truly RMW instead of prefetch, to make sure the line is exclusively owned without having to get GCC to emit prefetchw.

Perhaps worth running fillCache a couple times, just to make more sure that every line is properly in the state you want. But since your env is smaller than 4k, each line will be in a different set in L1d cache, so there's no risk that one line got tossed out while allocating another (except in case of an alias in L3 cache's hash function? But even then, pseudo-LRU eviction should keep the most-recent line reliably.)

Align your data by 128, an aligned-pair of cache lines

static struct CACHE_ENV { ... } cacheEnv; isn't guaranteed to be aligned by the cache line size; you're missing C11 _Alignas(64) or GNU C __attribute__((aligned(64))). So it might be spanning more than sizeof(T)/64 lines. Or for good measure, align by 128 for the L2 adjacent-line prefetcher. (Here you can and should simply align your buffer, but The right way to use function _mm_clflush to flush a large struct shows how to loop over every cache line of an arbitrary-sized possibly-unaligned struct.)

This doesn't explain your problem, since the only part that might get missed is the last up-to-48 bytes of env.out. (I think the global struct will get aligned by 16 by default ABI rules.) And you're only printing the first few bytes of each array.

An easier way: memset(0) to avoid leaking data back to DRAM

And BTW, overwriting your buffer with 0 via memset after you're done should also keep your data from getting written back to DRAM about as reliably as INVD, but faster. (Maybe a manual rep stosb via asm to make sure it can't optimize away as a dead store).

No-fill mode might also be useful here to stop cache misses from evicting existing lines. AFAIK, that basically locks down the cache so no new allocations will happen, and thus no evictions. (But you might not be able to read or write other normal memory, although you could leave a result in registers.)

No-fill mode (for the current core) would make it definitely safe to clear your buffers with memset before re-enabling allocation; no risk of a cache miss during that causing an eviction. Although if your fillCache actually works properly and gets all your lines into MESI Modified state before you do your work, your loads and stores will hit in L1d cache without risk of evicting any of your buffer lines.

If you're worried about DRAM contents (rather than bus signals), then clflushopt each line after memset will reduce the window of vulnerability. (Or memcpy from a clean copy of the original if 0 doesn't work for you, but hopefully you can just work in a private copy and leave the orig unmodified. A stray write-back is always possible with your current method so I wouldn't want to rely on it to definitely always leave a large buffer unmodified.)

Don't use NT stores for a manual memset or memcpy: that might flush the "secret" dirty data before the NT store. One option would be to memset(0) with normal stores or rep stosb, then loop again with NT stores. Or perhaps doing 8x movq normal stores per line, then 8x movnti, so you do both things to the same line back to back before moving on.

Why fillCache at all?

If you're not using no-fill mode, it shouldn't even matter whether the lines are cached before you write to them. You just need your writes to be dirty in cache when invd runs, which should be true even if they got that way from your stores missing in cache.

You already don't have any barrier like mfence between fillCache and changeFixedValue, which is fine but means that any cache misses from priming the cache are still in flight when you dirty it.

INVD itself is serializing, so it should wait for stores to leave the store buffer before discarding cache contents. (So putting mfence;lfence after your work, before INVD, shouldn't make any difference.) In other words, INVD should discard cacheable stores that are still in the store buffer, as well as dirty cache lines, unless committing some of those stores happens to evict anything.

(Some of this is what I posted in comments; and have already been applied in the question. It was enough stuff that it belonged in an answer, though.) — Peter Cordes, Mar 24 '21 at 04:33
CR0.CD is effectively per physical core. The effective CD value is the OR of the CD bits of each of the sibling logical cores. With HT enabled, setting CD to one disables cache fills in the entire physical core and so the secret cache lines are not filled. Although I don't think this approach guarantees that the secret data is not written to memory because there is no architectural guarantee that writebacks, replacements, or invalidations due to snoops don't occur. (Writebacks may occur automatically to reduce the probability of uncorrectable errors, perhaps even in no-fill mode.) — Hadi Brais, Mar 25 '21 at 14:01