Ignoring GC while profiling (sampling) an application

Question

I am profiling an application using sampling in VS 2012 (although the profiler doesn't matter that much). I have a good lead on where the performance bottleneck lies, however, I'm hampered by the fact that there's a lot of memory allocations going on, and the garbage collector seems to be significantly skewing my profiling (I can somewhat see the GC effects in CLR Profiler and Concurrency Visualizer).

Is there a way to somehow get rid of the samples acquired while the GC is running? I could use any of these:

Ignore samples collected while a GC is running (filter by function pointer?)
Separate time spent GCing and the time spent actually working
Increase GC limits to effectively "turn it off" for profiling
Actually turn off the GC

The problem is that I pretty much don't know what I need to optimize. Attempts at optimizing the GC by reducing allocations etc. had very low real impact on release versions without a debugger attached, so I really want to know how much of the profiling results are due to disabled optimizations etc., and how much is code that could be improved (the code in question is used by a large portion of our projects, so even a 10% increase in performance would have huge impact).

You cannot turn it off and you cannot wish it away. Sounds to me you found the *real* bottleneck you should be working on first. You generate too much garbage. Maybe you can re-use buffers. Maybe that's too impractical and it is already as good as it is going to get. — Hans Passant, Feb 17 '14 at 12:46
@HansPassant The problem is I don't know whether the code really is just that CPU intensive or not, because the particular functions that get the most samples are also doing a lot of allocations (and the rest of the application hardly does any, thanks to previous optimizations), thus leading to a GC in that code, even if it isn't necessarily their cause. I'm having trouble isolating the "real" work from the "stolen" CPU by the GC - ie. the question is - which to optimize: allocations or computations? How can I use the profiling data to separate the two? — Luaan, Feb 17 '14 at 12:53
Well, just don't assume that GC is not real work. It is, allocating memory is not free and your code is going to be slowed down by it on the user's machine as well. So the problem is not that your code is too cpu intensive, the problem is that it uses too much memory. This is entirely normal, trust what the profiler tells you. And focus on improving it by making it use memory more effectively. — Hans Passant, Feb 17 '14 at 13:02
@HansPassant Well, that's exactly my problem - I don't know if it's the GC that's giving me trouble. My code is definitely CPU intensive, and it's definitely putting significant memory pressure. But I have no idea how to use the profiling data to pinpoint which of the two is more important. And in practice, I've found out that while my memory optimizations did wonders in CLR profiler (allocated bytes dropping from 2.5 GiBs to ~100 MiBs), in the release version, they had incredibly low effect on performance. It seems likely that optimizing memory usage further will also yield very little. — Luaan, Feb 17 '14 at 13:07
I would say that it is more complicated than that. The GC can do a lot of work without affecting the performance of your application. The biggest impact of the GC comes from the fact that it sometimes has to suspend managed threads to safely move objects around on the heap. The CLR outputs ETW events when this happens. Correlating those events with your own events can give you a good idea of how GC affects your application's performance. — Brian Rasmussen, Feb 17 '14 at 20:26

score 3 · Accepted Answer · edited May 23 '17 at 10:29

I would suggest you back off and try a different approach. Here's how it works:

There is a speed bug in your program. (It is very unlikely there is not.) It you find it and fix it, you will save some fraction of time. Suppose it is 50%.

That means if you just run it under the IDE, and pause it manually while you're waiting for it, there is a 50% chance you will have stopped it in the time you would save. Find out what it's doing and why it's doing it, by looking at the call stack, each line of code on the call stack, and maybe the data.

Do this a small number of times, like 5, 10, or 20, depending on what you see. You will see it performing the speed bug on about 50% of those samples, guaranteed.

This will tell you some things that the profiler will not, such as:

If the speed bug is that you are performing lots of news that you could possibly avoid by re-using objects, it will show you the exact line(s) where that is happening, and the reason why. The sampling profiler can give you line-level inclusive time, but it cannot tell you the reason for the time being spent, and without knowing the reason you can't be sure you don't need it. OTOH, if the sample lands in GC, ignore it and look for new, because new is expensive too and it is what causes GC.
If the speed bug is that you are actually doing some file I/O or network access or sleeps deep inside some library routine you didn't know about, it will tell you that and why, and you can figure out a way around it. The sampling profiler will not tell you this because it is a "CPU profiler", meaning it sleeps whenever your program is blocked. If you switch to the instrumented profiler, it will not give you line-level precision. Neither way will it tell you the reason why the time is being spent.

You could have to endure some derision if you try this, but it will get you the results you want. What's more, if you find and fix that 50% speed bug, the program will be 2x faster. That has the effect of making further speed bugs easier to find. For example, if there was initially a 25% speed bug in addition to the 50% one, now it is a 50% one, and if you find and fix it you will be 4x faster. It can surprise you that you can keep going this way until you can't any more, and then you will be close to optimal.

I like your answer a lot. I know it's not a wait, since the CPU *is* actually running at full throttle, but those are all great suggestions. I'm going to try the blind pausing and see if it comes up with something interesting. So far it looks like zip decompression is the troublemaker, this could help identify the precise cause (and if I can do something about it). — Luaan, Feb 17 '14 at 20:55
@Luaan: Actually I prefer the term "opportunity" to "bug", but it's a lot more syllables. It conveys the notion that the program may be fine, but it can still be sped up. OTOH, the way you find it is - as if it were a bug. — Mike Dunlavey, Feb 17 '14 at 21:38
Okay, even though a huge amount of things change in the debugger, I was still able to find ways to remove almost 70% of the runtime. As I suspected, it was in fact in the ZIP library (pure C# but awfully slow). Working around some unnecessary out-of-bounds checks (yes, a value of `anything & 0xFF` will *not* be out of bounds of a 256-item array) and a few other things (like going through a byte array byte-by-byte instead of using `Array.Copy` / `Buffer.Copy`) helped immensely. The GC seems to be only a significant issue when running multi-threaded, but I don't need that in production. — Luaan, Feb 18 '14 at 12:25
@Luaan: That's great! Isn't knowing better than guessing? Good luck. — Mike Dunlavey, Feb 18 '14 at 15:17
Oh, you know, the delicate balance between `work` and `fun` :D — Luaan, Feb 18 '14 at 15:36

Ignoring GC while profiling (sampling) an application

1 Answers1