0

I had a little misunderstanding about .NET and CPU cache. I thought that only the thread stack was stored in CPU cache, but apparently part of the heap, concretely the Gen 0 is actually allocated in the CPU L2 cache.

I have read several things like: The initial size limit of Gen 0 is determined by the size of the processor cache.

But what happen when the Gen 0 is bigger than the size of the processor cache? Is then split between RAM and cache? how? is otherwise moved entirely to RAM? I have read comments of people claiming that they had a Gen 0 of around 500 Mb, so it is very unlikely they had a 500Mb CPU cache.

As far as I know (and I may know wrong), objects in Gen 0 can be shared across threads, so how is possible to share an object across threads scheduled in different CPUs if it is stored in a CPU cache? Does .NET care about putting the object in RAM if it is not local?

vtortola
  • 34,709
  • 29
  • 161
  • 263
  • 1
    I would recommend you to read famous https://disruptor.googlecode.com/files/Disruptor-1.0.pdf or either http://mechanical-sympathy.blogspot.cz/2011/07/memory-barriersfences.html. There are very nice pictures and understandable descriptions as well. Despite it's written "for java", caching on cpu level is not language dependent :-) – Martin Podval Jul 02 '14 at 12:33

1 Answers1

3

You have quite large misunderstandings about how the CPU cache (and really, the CPU itself, and the whole abstraction layer above it) works. .NET can't force anything to be in any CPU cache, that's solely the responsibility of the CPU and noone else. The cache is always a duplicate of RAM, if it is in the cache (and it's still valid), it will also be in RAM. In any case, all of those things are an implementation detail, and you can't rely on it anyway.

All of your questions require quite broad answers. The simple answer is that multi-threaded programming is very hard, and if you don't think so, you really don't have much experience yet :) Once you realize how huge amounts of assumptions and performance optimizations the CPUs do, you'll also realize how C++ isn't really all that much closer to the "real hardware" than C#.

All memory is shared across threads by default - if you pass the reference. This is bad, because it gives rise to synchronization issues. Some are caused by caching (whether it's in the CPU cache or even in the CPU registers), some are caused by the fact that most of the operations you do are not atomic.

Now, of course, if you're doing some isolated, CPU-bound work, you can gain a lot of benefit from being able to fit the whole memory you're working with to the CPU cache. You can only help this by using data structures small enough - you can't force a bit of information to be cached or anything (in fact, every single thing that you read from memory will be in the CPU cache at one point or another - the CPU can't read directly from RAM - RAM is way too slow). If you can fit your whole data inside the cache, and noone causes you to be evicted from the cache (remember, multi-tasking environment), you can get amazing performance even from conventionally expensive operations (e.g. lots of jumps around in memory rather than sequential access etc.).

As soon as you need to share data between threads, though, you're starting to get into trouble. You need synchronization to make sure the two CPUs (or CPU cores, I'm not going to distinguish between those) are actually working on the same data!

Now, in practice, you're going to find out that CPU caches tend to be shared between the cores to an extent. That's good, because sharing the CPU cache is still about an order of magnitude faster than doing the synchronization through RAM. However, you can still run into many issues, such as the very funny case like this pretty typical thread loop:

while (!aborted)
{
  ...
}

In theory, it is quite possible that this will simply happen to be an infinite loop. An aggressive compiler might see that you're never changing the value of aborted and simply replace the !aborted with true (.NET will not), or it might store the value of aborted inside a register.

Registers are not synchronized automatically by default at all. This can be quite a problem if the body of the thread loop is simple enough. As you dive deeper into multi-threaded programming, you'll be completely devastated by the code you used to write and the assumptions you used to have.

The most important thing to remember is that all those optimizations the compilers and CPUs do are only guaranteed to not change the behaviour if you're running them isolated and in a single thread. When you break that assumption, all hell breaks loose.

Luaan
  • 62,244
  • 7
  • 97
  • 116
  • 2
    The .NET runtime can force things to cache, using the prefetch instructions. It just can't ensure it stays there. – user1937198 Jul 01 '14 at 15:53
  • 1
    Also the instruction cache can be the responsibility of the runtime when it is doing codegen as it is the JITs responsibility to flush the generated code from the data cache. – user1937198 Jul 01 '14 at 15:55
  • @user1937198 It's hard to be sure about those things, since that's part of the (quite closed) implementation of the runtime itself. And given how likely those kinds of operations (loading the whole Gen 0 heap to cache) are going to be very expensive for very little benefit, I'm pretty sure .NET doesn't do that. And as for the JIT, I don't quite see how it would help performance to let .NET force that - the cache lines would be dropped rather fast, I think. But those are just (somewhat educated) guesses. – Luaan Jul 01 '14 at 16:02
  • The first was a counter to you absolute can't do it. The second comment is a correctness problem. x86 processors don't guarantee that if you write something out as data and then start executing it that you'll execute the data you wrote unless you explicitly flush the I$ and D$ caches. And that is exactly what the JIT does during codegen. – user1937198 Jul 01 '14 at 16:05
  • @user1937198 - Is this referring to something different? http://stackoverflow.com/a/10994728/27423 – Daniel Earwicker Jul 01 '14 at 16:19
  • @DanielEarwicker no, Hadn't seen that intel's cache coherency was so advanced. Still important on ARM which is a .net platform as well. – user1937198 Jul 01 '14 at 16:23
  • 1
    @user1937198 - Indeed. Intel takes care of a lot of gnarly stuff automatically that the RISC chips leave to the compiler writer (or even to the user, e.g. in about 1998 or so I ported a Win32 app from x86 to DEC Alpha and got a few memory alignment issues). – Daniel Earwicker Jul 01 '14 at 16:32
  • It nicely illustrates how hard it is to guess right even if you know *a lot*. All those abstraction layers are much deeper and more complicated than they seem. Which gets us back to the main point - measure. It helps to know more, but in the end you're about as likely to make a bad decision based on the incomplete understanding. – Luaan Jul 02 '14 at 07:21