2

Our multithreaded application does a lengthy computational loop. On average it takes about 29 sec for it to finish one full cycle. During that time, the .NET performance counter % time in GC measures 8.5 %. Its all made of Gen 2 collections.

In order to improve performance, we implemented a pool for our large objects. We archieved a 100% reusement rate. The overall cycle now takes only 20 sec on average. The "% time in GC" shows something between 0.3...0.5%. Now the GC does only Gen 0 collections.

Lets assume, the pooling is efficiently implemented and neglect the additional time it takes to execute. Than we got a performance improvement of roughly 33 percent. How does that relate to the former value for GC of 8.5%?

I have some assumptions, which I hope can be confirmed, adjusted and amended:

1) The "time in GC" (if I read it right) does measure the relation of 2 time spans:

  • Time between 2 GC cycles and
  • Time used for the last full GC cycle, this value is included into the first span.

What is not included into the second time span, would be the overhead of stopping and restarting the worker threads for the blocking GC. But how could that be as large as 20% of the overall execution time?

2) Frequently blocking the threads for GC may introduce contention between the treads? It is just a thought. I could not confirm that via the VS concurrency profiler.

3) In contrast to that, it could be confirmed that the number of page misses (performance counter: Memory -> Page Faults/sec) is significantly higher for the unpooled application (25.000 per second) than for the application with the low GC rate (200 per second). I could imagine, this would cause the great improvement as well. But what could explain that behaviour? Is it, because frequent allocations are causing a much larger area from the virtual memory address space to be used, which therefore is harder to keep into the physical memory? And how could that be measured to confirm as the reason here?

BTW: GCSettings.IsServerGC = false, .NET 4.0, 64bit, running on Win7, 4GB, Intel i5. (And sorry for the large question.. ;)

user492238
  • 4,094
  • 1
  • 20
  • 26
  • 2
    I am hesitating to answer your question because a lot of info is missing to give you a real answer. Anyhow - the overall performance improvement *Could* be due to: 1. Reduced GC time (which was counted as ~8.0% =~ 2 seconds) 2. All the time you saved not having to recreate those really large objects because they are now efficiently pooled. This can be very costly, and probably is if you are spending 8.5% of the time on GC. – NightDweller Apr 10 '11 at 16:51
  • @NightDweller Thanks for your comment. Those objects are 'only' large arrays. The only time saved by pooling them is the time needed to clear them initially (set its elements to 0). The pooling on the other hand I would expect to be much more costly then any managed heap allocation would be. – user492238 Apr 10 '11 at 17:04
  • 1
    Allocation can be expensive under certain conditions - if you reallocate very often, and if you require long continuous blocks of memory which may become rare due to fragmentation.. But i must admit it doesn't seem all that likely. Did you try running a profiler on the code? (before/after) that should give you a very clear answer on what improved... – NightDweller Apr 10 '11 at 21:40
  • @NightDweller You are right. In our case, without pooling, every 'new' would potentially cause the allocation of a new LOH chunk from the memory manager. Regarding profiling: yes I do all the time. But its hard to say from that logs. They look very similar and - due to the multithreading, are hardly comparable. Thanks again. – user492238 Apr 11 '11 at 03:16

2 Answers2

3

Then we got a performance improvement of roughly 33 percent. How does that relate to the former value for GC of 8.5%?

By pooling, you're also saving the time spent in new, which can be considerable, but I wouldn't spend time trying to balance the numbers.

Rather than "look a gift horse in the mouth", why not move on to finding other "bottlenecks"?

When you remove one performance problem, you make others take a larger percentage of the time, because the denominator is smaller. So they are easier to find, provided you know how to look for them.

Here's an example, and a method. You clean out one big problem. That makes the next one bigger, by percent, so you clean that one out. Rinse and repeat. It may get to take so little time that you need to wrap a temporary outer loop around it, just to make it take long enough to investigate. You keep going this way, progressively making the program take less and less time, until you hit diminishing returns.

That's how to make the code fast.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thanks, I appreciate your feedback. But that misses the question? I am not concerned for any *specific* application performance, but rather seeking a deeper understanding of the fundamental reasons which caused the performance improvement. By learning from that, I possibly can save a lot more time in the future, by immediately preventing from those bottlenecks (which I by now didnt even imagine, they exist). – user492238 Apr 11 '11 at 03:57
  • @user492238: Well, as I said, when you use a pool to recycle used objects, you not only save the GC time, but the `new` time. I often see programs spending a large amount of time 1) allocating memory and running initializers, or 2) running destructors. That's what you save when pooling. *However*, no matter how good you and I are at avoiding slowness, *it happens*. So it's even more important to learn how to find it so you can remove it, after it happens, as well as before. – Mike Dunlavey Apr 11 '11 at 12:08
2

Pre-allocating the objects improves concurrency, the threads no longer have to enter the global lock that protects the garbage collected heap to allocate an object. The lock is held for a very short time, but clearly you were allocating a lot of objects so it isn't unlikely that threads fight for the lock.

The 'time in GC' performance counter measures the percentage of cpu time spent collecting instead of executing regular code. You'll can get a big number if there are a lot of gen# 2 collections and the rate at which you allocate objects is so great that background collection can no longer keep up and the threads must be blocked. Having more threads makes that worse, you can allocate more.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536