10

When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these entities is often short-lived and the total amount of memory is often relatively small and should fit into a CPU cache.

At the point the memory used for a temporary string gets released the string content may very well have only lived within the cache. However, the CPU is unaware of the memory being deallocated: the deallocation is just an update in the memory management system. As a result, the CPU may end up writing the unused content unnecessarily to actual memory when the CPU cache is used for other memory - unless the memory release somehow indicates to the CPU that the memory isn't used anymore. Hence, the question becomes:

Do memory management functions releasing memory somehow indicate that the content of the corresponding memory can be discarded? Is there even a way to indicate to the CPU that memory is no longer used? (at least, for some CPUs: there may, obviously, be differences between architectures) Since different implementations will likely differ in quality and may or may not do anything fancy, the question really is if there is any memory management implementation indicating memory as unused?

I do realize that always using the same arena of memory may be a mitigation strategy to avoid unnecessary writes to the actual memory. In that case the same cached memory would be used. Similarly, it may be likely that the memory allocation always yields the same memory also avoiding unnecessary memory transfers. However, possibly I don't need to rely on any of these techniques being applicable.

Iwillnotexist Idonotexist
  • 13,297
  • 4
  • 43
  • 66
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • In general, this is implementation-defined. Could you narrow this question down to one programming language (C/C++ is not a programming language) and possibly a particular implementation? Otherwise we quickly reach “too broad” territory. – fuz Dec 02 '15 at 09:08
  • 1
    @FUZxxl: Mr Kühl is well aware of the differences, I can assure you. But the CPU has no idea whether it's executing `free()` or `delete`, so we can safely ignore those differences. – MSalters Dec 02 '15 at 09:46
  • @MSalters But `free` and `delete` are different functions supplied by different libraries by different languages. I'm not going to go back to the discussion about why it's discouraged to ask questions about multiple languages at once, just look it up. – fuz Dec 02 '15 at 09:51
  • 1
    @FUZxxl: So? The question is about the **implementation** of such functions. That is inherently non-portable code anyway, and in fact the question is about the use of CPU cache control instructions by C and C++ memory allocators. – MSalters Dec 02 '15 at 09:57
  • @MSalters If the question is about implementations regardless of language, then it should be [tag:language-agnostic]. And please, no “C and C++” questions. These are discouraged. Ask one about C and one about C++ if you are interested in both. – fuz Dec 02 '15 at 10:00
  • @FUZxxl: I removed both the C and C++ tags as they question really is about neither. It just happens that these are the languages where people to care about performance. Of course, as a result of removing these tags I shall give up hope for obtaining an answer [actually answering what I'm asking about]. – Dietmar Kühl Dec 02 '15 at 10:04
  • 1
    @FUZxxl: The problem with `language-agnostic` is that this question makes no sense for 95% of languages. And we have no tag for "high-performance languages that compile to native applications, with deterministic explicit memory deallocation". – MSalters Dec 02 '15 at 10:05
  • 1
    @FUZxxl Many questions can be answered for both C and C++ languages identically for many topics; while if they can't, it's useful to draw that distinction. An answer is the richer for doing so, and as a result I don't share your aversion to tagging [c] & [c++]. OP's question is _a priori_ especially liable to have a common answer, since it's unlikely that, say, `glibc` would use a cache trick that `libstdc++` doesn't. Incidentally, most languages' runtime/interpreter is based, somehow, atop the C library! I share OP's concerns about lack of exposure too, with 88 views despite the max bounty. – Iwillnotexist Idonotexist Dec 05 '15 at 17:28

5 Answers5

9

No.

The cache operation you mention (marking cached memory as unused and discarding without writeback to main memory) is called cacheline invalidation without writeback. This is performed through a special instruction with an operand that may (or may not) indicate the address of the cacheline to be invalidated.

On all architectures I'm familiar with, this instruction is privileged, with good reason in my opinion. This means that usermode code cannot employ the instruction; Only the kernel can. The amount of perverted trickery, data loss and denial of service that would be possible otherwise is incredible.

As a result, no memory allocator could do what you propose; They simply don't have (in usermode) the tools to do so.

Architectural Support

  • The x86 and x86-64 architecture has the privileged invd instruction, which invalidates all internal caches without writeback and directs external caches to invalidate themselves also. This is the only instruction capable of invalidating without writeback, and it is a blunt weapon indeed.
    • The non-privileged clflush instruction specifies a victim address, but it writes back before invalidating, so I mention it only in passing.
    • Documentation for all these instructions is in Intel's SDMs, Volume 2.
  • The ARM architecture performs cache invalidation without writeback with a write to coprocessor 15, register 7: MCR p15, 0, <Rd>, c7, <CRm>, <Opcode_2>. A victim cacheline may be specified. Writes to this register are privileged.
  • PowerPC has dcbi, which lets you specify a victim, dci which doesn't and instruction-cache versions of both, but all four are privileged (see page 1400).
  • MIPS has the CACHE instruction which can specify a victim. It was privileged as of MIPS Instruction Set v5.04, but in 6.04 Imagination Technologies muddied the water and it's no longer clear what's privileged and what not.

So this excludes the use of cache invalidation without flushing/writing back in usermode outright.

Kernel mode?

However, I'd argue that it's still a bad idea in kernelmode for numerous reasons:

  • Linux's allocator, kmalloc(), allocates out of arenas for different sizes of allocations. In particular, it has an arena for each allocation size <=192 bytes by steps of 8; This means that objects can be potentially closer to each other than a cacheline or partially overlap the next one, and using invalidation could thus blow out nearby objects that were rightly in cache and not yet written back. This is wrong.
    • This problem is compounded by the fact that cache lines may be quite large (on x86-64, 64 bytes), and moreover are not necessarily uniform in size throughout the cache hierarchy. For instance, Pentium 4 had 64B L1 cachelines but 128B L2 cachelines.
  • It makes the deallocation time linear in the number of cachelines of the object to deallocate.
  • It has a very limited benefit; The size of L1 cache is usually in the KB's, so a few thousand flushes will fully empty it. Moreover the cache may already have flushed the data without your prompting, so your invalidation is worse than useless: The memory bandwidth was used, but you no longer have the line in cache, so when it will next be partially written to it will need to be refetched.
  • The next time the memory allocator returns that block, which might be soon, the user thereof will suffer a guaranteed cache miss and fetch from main RAM, while he could have had a dirty unflushed line or a clean flushed line instead. The cost of a guaranteed cache miss and fetch from main RAM is much greater than a cache line flush without invalidation tucked in somewhere automatically and intelligently by the caching hardware.
  • The additional code required to loop and flush these lines wastes instruction cache space.
  • A better use for the dozens of cycles taken by aforementioned loop to invalidate cachelines would be to keep doing useful work, while letting the cache and memory subsystem's considerable bandwidth write back your dirty cachelines.
    • My modern Haswell processor has 32 bytes / clock cycle write L1 bandwidth and 25GB/s main RAM bandwidth. I'm sure a couple extra flushable 32-byte cachelines can be squeezed in somewhere in there.
  • Lastly, for short-lived, small allocations like that, there's the option of allocating it on the stack.

Actual memory allocator practice

  • The famed dlmalloc does not invalidate freed memory.
  • glibc does not invalidate freed memory.
  • jemalloc does not invalidate freed memory.
  • musl-libc's malloc() does not invalidate freed memory.

None of them invalidate memory, because they can't. Doing a system call for the sake of invalidating cachelines would be both incredibly slow and would cause far more traffic in/out of cache, just because of the context switch.

Iwillnotexist Idonotexist
  • 13,297
  • 4
  • 43
  • 66
  • 1
    Very good answer! I'm not entirely agreeing with the first argument against kernel-mode usage, though. Kernel code is held to a higher standard of correctness, with good reason. Invalidating a cache line which still holds valid data is one of about a thousand possible bugs, **each** of which would be **wrong**. It's also trivially avoided since each arena has a known allocation size; just don't flush when this size is smaller than the cache line size. – MSalters Dec 08 '15 at 15:56
  • @MSalters Mmm, the thing is that manipulating objects less than a cacheline in size, misaligned objects, or objects partially spanning two cachelines is not uncommon. It's also about to get more frequent; Cachelines are already 64 B in L1 in x86-64, and Intel Kabylake will crank this to 256 B. You're not going to be able to invalidate very often at that rate... You've also made me think of another problem. Caches are not obliged to all share a same cacheline size. AFAIK the Pentium 4 used [64 B lines in L1 but 128 B in L2](http://www.ccs.neu.edu/course/com3200/parent/NOTES/cache-basics.html). – Iwillnotexist Idonotexist Dec 08 '15 at 18:28
  • 1
    Well x86 doesn't have an instruction to invalidate a single cache line anyway, as `invd` throws out everything. As far as L1/L2 size differences, this might not be a real issue. Discarding an L1 line would eliminate the write to L2. This increases the chance of the L2 line staying "unmodified" which means no main memory write. However, the other half of the L2 line might already have been modified, which means that the whole line is (correctly) written back, with the right half updated. – MSalters Dec 10 '15 at 01:59
  • @MSalters Isn't what you describe not quite an invalidation? An invalidation is intended to boot out that address from the entire cache hierarchy, without WB, causing future accesses to fetch out of main RAM. If the left half is invalidated in L1, then it must disappear also from L2; Else L1 misses will be serviced out of L2. Writing both halves to memory defeats the purpose of inv without WB as well as OP's aim. IMHO the only correct solution is to invalidate the right half also, and document that as the OS developer's concern, else have instructions that identify which cache they refer to. – Iwillnotexist Idonotexist Dec 10 '15 at 02:23
  • 1
    It's not an absolute invalidation, but still an optimization. L2 is a lot bigger than L1; discarding an L1 line is therefore more worthwhile than discarding an L2 line. Also, we might have to write back the L2 line for other reasons, but that's far from certain. We still save an L2->memory write if the line is evicted with one half unmodified and the other half discarded from L1. – MSalters Dec 10 '15 at 18:14
2

I am not aware of any architecture that would willingly expose its cache coherence protocols to software (user or even kernel) manipulation like this. This would create caveats that are practically impossible to handle. Note that user-initiated flushing is an acceptable exposure but in no way threatens to break memory coherency.

Just as an example, imagine you have a cache line with a temporary data you no longer need. Since it was written to, it would be in "modified" state in the cache. Now you want a mechanism that tells the cache to avoid writing it back, but that means that you create a race condition - if someone else were to look for the line before you applied this trick, he would have snooped it out of the core and received the updated data. If you core had its way first, the new data would have been lost - therefore, the outcome of that address in memory depends on a race.

You might argue that in multithreaded programming that is often the case, but this scenario may also occur when running a single thread (the CPU may voluntarily evict a line earlier if the cache is full, or some lower inclusive level loses it). Worse, this breaks the premise that the entire virtual memory appears as flat, and cached versions are maintained by the CPU only for performance, but can not break coherency or consistency (except in some documented multithreaded cases depending on memory ordering model, which can be overcome by software protection).

Edit: If you are willing to extend the definition of what you consider as "memory", you could seek non-coherent types of memory, which differ in definition and implementation but some may provide what you seek. Some architectures expose "scratchpad" memory, which is controlled by the user and allows fast access without the hassle of cache coherency (but also without its benefits). Some architectures even go as far as providing configurable hardware that allows you to select whether you prefer to cache main memory in it, or to use it as a scratchpad area.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • +1, exactly right. This is the sort of thing I had in mind when I wrote about _perverted trickery, data loss and denial of service_ in my answer. I amongst other things had in mind two threads sharing a hyperthreading core. – Iwillnotexist Idonotexist Dec 11 '15 at 22:22
1

This is very much a matter of the implementation and the library that you are using. Allocated and freed memory tends to be reallocated very quickly. Most allocations are in small blocks much smaller than a page that would be written to backing storage when needed.

And today, RAM sizes are typically so large that when the OS starts writing dirty pages to backing store, you are in trouble no matter what. If you have 16 GB of RAM, you won't be writing hundred kilobytes or a megabyte, you will be writing gigabytes and your computer will slow down to a crawl. The user will avoid the situation by not using applications that use too much memory.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • I'm actually more interested in the CPU cache-level than writing pages to backing store. The relation between CPU caches and main memory is, however, indeed similar to that between main memory and backing store. – Dietmar Kühl Dec 02 '15 at 09:21
  • 1
    There's actually a significant dissimilarity. Deleted files are _not_ flushed to disk. The reason for the dissimilarity is that the relevant bookkeeping do discard such a page is similar to that needed to discard a dirty L1 line, but the savings of not writing a page to disk are several orders of magnitude bigger. – MSalters Dec 02 '15 at 09:43
1

Quite a few allocators store the "free block list" in the free blocks themselves. I.e. when you call that deallocation function, the allocated block is spliced into the free list, which could mean overwriting the old data with forward and backward pointers. These writes would overwrite at least the first part of the allocation.

A second technique used by allocators is to aggressively recycle memory. If the next allocation can be matched with the latest deallocation, chances are that the cache wasn't flushed to main memory.

The problem with your idea is that each individual write isn't actually that expensive, and figuring out what can be discarded would involve quite a bit of expensive bookkeeping. You can't do a syscall, realistically. That means you need to do the bookkeeping in each application (which is reasonable: deallocation of these small blocks usually returns the memory to the application, not the OS). That in turn means the application needs to know about the CPU cache design, which is by no means constant. The application would even need to be aware of the different cache coherence schemes!

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Given that the application interfaces with the memory via the memory manage system whose default implementation, at least, ships with the compiler/standard library, I don't think application level code would need to know about the cache layouts. Instead, "only" the memory management system would need to know (on that level it would also be known what part of the released memory continues to be used for the purpose of the memory management system). – Dietmar Kühl Dec 02 '15 at 09:23
  • @DietmarKühl: I'm using "application" here in the sense of "not kernel". RIng 3 instead of ring 0, in x86 parlance. Practically speaking, you'd implement this not just via the standard library, but by having the OS map a code page full of the relevant instructions into each process. – MSalters Dec 02 '15 at 09:34
1

You're asking a number of related questions here. The one in boldface is the easiest to answer. When you release memory with anything like the generic kind of release, the only thing you're saying is "I don't need this any more". You are also saying, implicitly, "I don't care what you do with it". This "I don't care" is actually the answer to your question. You are not saying "you can discard this". You are saying "I don't care if you discard it or not".

To answer your question about CPU support, the MSI protocol is a basic cache-coherence protocol. The I state stands for "invalid", which is how you could implement the "memory not used" state you're asking about. In order to do this, you'd create a release interface with non-generic semantics, that is, this kind of release means "This memory is no longer used and you should avoid writing it back to main memory". This semantics, please note, have a requirement on the behavior of the CPU that the generic version does not. To implement this, you'd need to allocate memory in harmony with the CPU cache and then use CPU instructions available to invalidate cache items. You'd almost certainly need to write assembly code to make this work to avoid unwarranted (and incorrect) assumptions about the memory model that use of explicit cache management instruction would induce.

I personally haven't needed to work at this level in some time, so I'm not familiar with what's available everywhere, that is, whether this technique could be made reasonably portable. Intel CPU's have INVLPG instruction. The discussion here should be a decent launching pad to the next phase of your concerns: When to do or not do INVLPG, MOV to CR3 to minimize TLB flushing

Community
  • 1
  • 1
eh9
  • 7,340
  • 20
  • 43
  • `INVLPG` invalidates TLB entries; That's not the same thing as data cache, far from it. Moreover it's a privileged instruction, so you can't use it from userland code. The page you linked to talks about an activity done in kernel mode, not in user mode. As I pointed out in my answer below, you can't and shouldn't be able to write usermode invalidation-without-writeback code on any platform. – Iwillnotexist Idonotexist Dec 05 '15 at 06:11
  • @IwillnotexistIdonotexist As I said in my answer I'm not familiar with the nuts and bolts of how you might implement this; given your answer, you apparently are. I mentioned the instruction only to point in the direction where you'd look to start after you'd figured out what you want to implement. The original question, more than anything else, was about the proper semantics of these calls, which is what I focused on. – eh9 Dec 05 '15 at 14:34
  • The non-generic release indeed is a _should avoid writeback_. That means it's not an actual _shall_ requirement. And note that this is specifically in the context of a memory allocator, which generally comes as part of the C or C++ implementation. "Assembly" is not a downside there, compiler use that all the time. – MSalters Dec 10 '15 at 02:03