4

Does clflush1 also flush associated TLB entries? I would assume not since clflush operates at a cache-line granularity, while TLB entries exist at the (much larger) page granularity - but I am prepared to be suprised.


1 ... or clflushopt although one would reasonably assume their behaviors are the same.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 2
    On HSW/BDW, based on the experiments I've done in the past, `clflush` does not flush TLB entries. Are you interested in specific microarchitectures? The answer is probably "No" on all of them. – Hadi Brais Jan 15 '19 at 17:50
  • @HadiBrais - I am interested in all recent Intel and AMD architectures, sort of roughly in proportion to their existence in the wild. – BeeOnRope Jan 15 '19 at 17:52

2 Answers2

4

I think it's safe to assume no; baking invlpg into clflush sounds like an insane design decision that I don't think anyone would make. You often want to invalidate multiple lines in a page. There's also no apparent benefit; flushing the TLB as well doesn't make it any easier to implement data-cache flushing.

Even just dropping the final TLB entry (without necessarily invalidating any page-directory caching) would be weaker than invlpg but still not make sense.

All modern x86s use caches with physical indexing/tagging, not virtual. (VIPT L1d caches are really PIPT with free translation of the index because it's taken from address bits that are part of the offset within a page.) And even if caches were virtual, invalidating TLB entries requires invaliding virtual caches but not the other way around.


According to IACA, clflush is only 2 uops on HSW-SKL, and 4 uops (including micro-fusion) on NHM-IVB. So it's not even micro-coded on Intel.

IACA doesn't model invlpg, but I assume it's more uops. (And it's privileged so it's not totally trivial to test.) It's remotely possible those extra uops on pre-HSW were for TLB invalidation.

I don't have any info on AMD.


The fact that invlpg is privileged is another reason to expect clflush not to be a superset of it. clflush is unprivileged. Presumably it's only for performance reasons that invlpg is restricted to ring 0 only.

But invlpg won't page-fault, so user-space could use it to invalidate kernel TLB entries, delaying real-time processes and interrupt handlers. (wbinvd is privileged for similar reasons: it's very slow and I think not interruptible.) clflush does fault on illegal addresses so it wouldn't open up that denial-of-service vulnerability. You could clflush the shared VDSO page, though.

Unless there's some reason why a CPU would want to expose invlpg in user-space (by baking it in to clflush), I really don't see why any vendor would do it.


With non-volatile DIMMs in the future of computing, it's even less likely that any future CPUs will make it super-slow to loop over a range of memory doing clflush. You'd expect most software using memory mapped NV storage to be using clflushopt, but I'd expect CPU vendors to make clflush as fast as possible, too.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Well one reason you'd possibly want `clflush` to drop the TLB entry is if you were using it for performance or benchmark related cache control ("I don't want this area cached anymore") rather than any type of durability guarantees with NVRAM or similar. In that case, if you flush an entire page, you might want to also drop the associated TLB entries, since the point is to free up resources associated with that memory. I 100% agree though that this is both difficult, of dubious value and it is unlikely to be implemented in that way. – BeeOnRope Jan 16 '19 at 17:31
  • 1
    It does mean that using `clflush` as part of a benchmark to test "out of cache" behavior isn't very realistic: it tends to leave the system in a state that it would be unlikely to get into naturally: all flushed regions not present in any level of the cache, but the various TLB levels all still hot. In the real world, memory that tends to have aged out of the cache would see a similar lack of TLB entries (as a first level approximation: yes it's complicated by different granularities). So flushing memory by a lot of dummy writes is probably more realistic (but slow). – BeeOnRope Jan 16 '19 at 17:33
3

The dTLB-loads-misses:u performance event can be used determine whether clflush flushes the TLB entry that maps the specified cache line. This event occurs when a load misses in all TLB levels and causes a page walk. It's also more widely supported compared to dTLB-stores-misses:u. In particular, dTLB-loads-misses:u is supported on the Intel P4 and later (except Goldmont) and on AMD K7 and later.

You can find the code at https://godbolt.org/z/97XkkF. It takes two parameters:

  • argv[1], which specifies whether all lines of the specified 4KB page should be flushed or only a single cache line.
  • argv[2], which specifies whether to use clflush or clflushopt.

The test is simple. It allocates a single 4KB page and accesses the same location a large number of times using a load instruction. Before every access, however, a cache flushing operation is performed as specified by argv[1] and argv[2]. If the flush caused the TLB entry to be evicted, then a dTLB-loads-misses:u event will occur. If the number of events is anywhere close to the number of loads, then we may suspect that the flush had an impact on the TLB.

Use the following commands to compile and run the code:

gcc -mclflushopt -O3 main.c
perf stat -e dTLB-loads-misses:u ./a.out wholePage opt

where wholePage and opt can be 0 or 1. So there are 4 cases to test.

I've run the test on SNB, IVB, HSW, BDW, and CFL. On all processors and in all cases, the number of events is very negligible. You can run the test on other processors.


I've managed to also run a test for WBINVD by calling ioctl in the loop to a kernel module to execute the instruction in kernel mode. I've measured dTLB-loads-misses:u, iTLB-loads-misses:u, and icache_64b.iftag_miss:u. All of them of are very negligible (under 0.004% of 1 million load instructions). This means that WBINVD does not flush the DTLB, ITLB, or the instruction cache. It only flushes the data caches.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • I've seen a mention that on P5 Pentium, "software TLB miss handling was a performance win" (because HW pagewalk bypassed L1d cache). That might have been a brain-fart from Andy Glew, maybe he meant to say "would have been". See [What happens after a L2 TLB miss?](//stackoverflow.com/q/32256250) for the quote. But anyway, if x86 has some way to disable HW pagewalk (and use `wrmsr` or MMIO or something to update TLB entries?? I couldn't find anything), then TLB state is somewhat architectural, and might explain why `invlpg` *needs* to be privileged, and can't be part of `clflush`. – Peter Cordes Mar 12 '19 at 06:36
  • But I don't think SW TLB-miss handling was ever actually a thing on x86. On some ISAs that started with SW TLB handling (like MIPS), there might be ways to turn on HW pagewalk, IDK. – Peter Cordes Mar 12 '19 at 06:37
  • @PeterCordes Yes I think he meant "would have been." In fact, Section 4.5.4 of the 386 datasheet (https://en.wikichip.org/wiki/intel/80386) clearly mentions that it is the processor that handles TLB misses, so hardware page walking is used on all Intel processors that support paging. Note that `invlpg` was first supported on the 486. I don't think it *needs* to be priviledged, but I think it's better to make that way. Table 2-3 of the Intel manual V3 mentions that `invlpg` is not useful for user apps. Also if it was not privileged, a malicious program might use it to evict kernel TLB entries – Hadi Brais Mar 12 '19 at 07:13
  • which may impact the performance of the system. An alternative design would be perhaps to check whether page belongs to the kernel or user, but Intel perhaps didn't deem that worth the effort. – Hadi Brais Mar 12 '19 at 07:14
  • I knew x86 has always *had* HW page walking when paging is enabled, but it was plausible that P5 had a CR bit or MSR that could disable it (so a 386 datasheet doesn't prove anything). Andy worked on MIPS (at Imation) for a while, including when he wrote those comments; maybe he was thinking of that. So yeah, if you've never heard of SW TLB handling on x86 either, then it's probably not a thing. – Peter Cordes Mar 12 '19 at 07:21