2

At least two dozen of my employer's customers are reporting that our .NET 3.5 application will hang/freeze/stop responding, usually for 10 to 90 seconds, and then become responsive again.

We have been unable to reproduce the issue. We were eventually able to get one of the customers to send a few memory dumps that were created via Task Manager while the application was frozen.

I was surprised to find out that the .NET 3.5 Garbage Collector appears to be suspending the application. Here is some of the WinDbg output (!load wow64exts and !sw are there because it is a 32-bit application running on 64-bit Windows, and the memory dump was created with the 64-bit version of Task Manager, and the only way I could get WinDbg SOS commands to work was with wow64exts.

What could be the cause?

What options do I have to solve or work around it?

The dump file is here (beware - it is 1.5GB): https://1drv.ms/u/s!AolCnuRwlhMVhBi_F2UVBTYkoVtY

Output of k and !clrstack on main GUI thread:

0:000> !load wow64exts
0:000> !sw
Switched to Guest (WoW) mode
0:000:x86> .loadby sos mscorwks
0:000:x86> k
 # ChildEBP RetAddr  
00 01cfe97c 760b45c9 ntdll_77100000!NtWaitForSingleObject+0xc
01 01cfe9f0 710858fd KERNELBASE!WaitForSingleObjectEx+0x99
02 01cfea34 71085849 mscorwks!CLREventWaitHelper+0x2f
03 01cfea84 71085868 mscorwks!CLREvent::WaitEx+0x117
04 01cfea98 7114eaa5 mscorwks!CLREvent::Wait+0x17
05 01cfeaac 711b0f70 mscorwks!WKS::GCHeap::WaitUntilGCComplete+0x34
06 01cfeae8 710fb420 mscorwks!Thread::RareDisablePreemptiveGC+0x1b4
07 01cfeb84 6f16a19f mscorwks!JIT_RareDisableHelper+0x16
08 01cfec18 6f169c77 System_Windows_Forms_ni+0x20a19f
09 01cfec70 6f169ac1 System_Windows_Forms_ni+0x209c77
0a 01cfeca0 6f67fd12 System_Windows_Forms_ni+0x209ac1
0b 01cfecb4 0b8e9c91 System_Windows_Forms_ni+0x71fd12
WARNING: Frame IP not in any known module. Following frames may be wrong.
0c 01cfece4 03c70f42 0xb8e9c91
0d 01cfed20 71081b6c 0x3c70f42
0e 01cfed30 710985ab mscorwks!CallDescrWorker+0x33
0f 01cfedb0 710a064b mscorwks!CallDescrWorkerWithHandler+0xa3
10 01cfeef4 710a067e mscorwks!MethodDesc::CallDescr+0x19c
11 01cfef10 710a069c mscorwks!MethodDesc::CallTargetWorker+0x1f
12 01cfef28 7110579c mscorwks!MethodDescCallSite::Call+0x1a
13 01cff08c 711607bd mscorwks!ClassLoader::RunMain+0x20d
14 01cff2f4 71160cda mscorwks!Assembly::ExecuteMainMethod+0xa6
15 01cff7c4 71160ec4 mscorwks!SystemDomain::ExecuteMainMethod+0x45e
16 01cff814 71160df4 mscorwks!ExecuteEXE+0x59
17 01cff85c 7163d93b mscorwks!_CorExeMain+0x15c
18 01cff89c 7177e8b9 mscoreei!_CorExeMain+0x10e
19 01cff8b0 71784e18 mscoree!ShellShim__CorExeMain+0xa9
1a 01cff8b8 75d68484 mscoree!_CorExeMain_Exported+0x8
1b 01cff8cc 7716302c kernel32!BaseThreadInitThunk+0x24
1c 01cff914 77162ffa ntdll_77100000!__RtlUserThreadStart+0x2f
1d 01cff924 00000000 ntdll_77100000!_RtlUserThreadStart+0x1b
0:000:x86> !clrstack
OS Thread Id: 0x26ec (0)
ESP       EIP     
01cfeb90 0000002b [InlinedCallFrame: 01cfeb90] System.Windows.Forms.UnsafeNativeMethods.WaitMessage()
01cfeb8c 6f16a188 System.Windows.Forms.Application+ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(Int32, Int32, Int32)
01cfec28 6f169c77 System.Windows.Forms.Application+ThreadContext.RunMessageLoopInner(Int32, System.Windows.Forms.ApplicationContext)
01cfec7c 6f169ac1 System.Windows.Forms.Application+ThreadContext.RunMessageLoop(Int32, System.Windows.Forms.ApplicationContext)
01cfecac 6f67fd12 System.Windows.Forms.Application.Run()
01cfecbc 0b8e9c91 NavSim.Application.StartupManager.Run()
01cfecec 03c70f42 NavSim.Application.EntryPoint.Main(System.String[])
01cfef48 71081b6c [GCFrame: 01cfef48] 

Output of !threads:

0:000:x86> !threads
ThreadCount: 30
UnstartedThread: 0
BackgroundThread: 25
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
                                      PreEmptive   GC Alloc           Lock
       ID OSID ThreadOBJ    State     GC       Context       Domain   Count APT Exception
   0    1 26ec 01ef7458      6020 Enabled  00000000:00000000 01ef2e28     0 STA
   2    2  f70 01f06878      b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Finalizer)
XXXX    3    0 07a02478   1801820 Enabled  00000000:00000000 01ef2e28     0 Ukn (Threadpool Worker)
   5    4 2970 0b921530      b220 Enabled  00000000:00000000 01ef2e28     0 MTA
   8    5 2824 07a0c1b8   200b220 Enabled  00000000:00000000 01ef2e28     0 MTA
   4    9 1dac 0ba71370       220 Enabled  00000000:00000000 01ef2e28     0 Ukn
  11    a 2498 0ba04f50      b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  12    b 1ab4 0bab8f78      b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  14    d 2b44 0bab8b90      b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  17   10 128c 0bab9b30   200b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  18   11  35c 0bab7fd8      b220 Enabled  00000000:00000000 01ef2e28     5 MTA
  19   12 2fd4 27c47580      b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  15   14 3308 27c48520       220 Enabled  00000000:00000000 01ef2e28     0 STA
  22    e  b44 27c48cf0   200b020 Enabled  00000000:00000000 01ef2e28     0 MTA
  23    c 1694 0bab83c0   200b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  24   1a 19e8 281657c0    80a220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Completion Port)
  25   16  24c 28165ba8   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  26   1e  e5c 0bab9748   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  27   21 3258 28152a38   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  28    f 1248 28151e80   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  29   20  85c 28151a98   880b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Completion Port)
  30   1f 22c0 28152268   200b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  31   1b 21e0 28154978      b020 Enabled  6816529c:68166080 01ef2e28     0 MTA
  32   19 229c 28152650   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  36   17 2270 28152e20   200b220 Enabled  00000000:00000000 01ef2e28     0 MTA
  37    8 2348 28150ee0      b020 Disabled 681660a4:681660a4 01ef2e28     1 MTA (GC)
  38   13 1234 28153208   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  39   18 16b0 281535f0   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  40    7 2f6c 281539d8   180b222 Disabled 00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)
  41    6  3f4 281512c8   180b220 Enabled  00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)

Output of !threadpool:

0:000:x86> !threadpool
CPU utilization 81%
Worker Thread: Total: 9 Running: 9 Idle: 0 MaxLimit: 1000 MinLimit: 4
Work Request in Queue: 940
AsyncTimerCallbackCompletion TimerInfo@27ff74e0
AsyncTimerCallbackCompletion TimerInfo@2823e5f8
AsyncTimerCallbackCompletion TimerInfo@27ea3260
AsyncTimerCallbackCompletion TimerInfo@41444968
AsyncTimerCallbackCompletion TimerInfo@2823eb08
AsyncTimerCallbackCompletion TimerInfo@41444f08
AsyncTimerCallbackCompletion TimerInfo@27ffb938
AsyncTimerCallbackCompletion TimerInfo@27ffbb30
AsyncTimerCallbackCompletion TimerInfo@27ffb620
AsyncTimerCallbackCompletion TimerInfo@27ff7888
AsyncTimerCallbackCompletion TimerInfo@27ea4028
AsyncTimerCallbackCompletion TimerInfo@2823d6c8
AsyncTimerCallbackCompletion TimerInfo@27ea32f0
AsyncTimerCallbackCompletion TimerInfo@2823e4d8
AsyncTimerCallbackCompletion TimerInfo@07a93d18
AsyncTimerCallbackCompletion TimerInfo@2823ed48
AsyncTimerCallbackCompletion TimerInfo@4102d320
AsyncTimerCallbackCompletion TimerInfo@4102d170
[NOTE: 922 SIMILAR LINES AS ABOVE SNIPPED FOR 30000 QUESTION CHARACTER LIMIT]
--------------------------------------
Number of Timers: 73
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 0 MaxLimit: 1000 MinLimit: 4

Native call stack (k) of thread 37, the thread which is marked (GC):

0:000:x86> ~37s 
ntdll_77100000!NtWaitForSingleObject+0xc:
7716a25c c20c00          ret     0Ch
0:037:x86> k
 # ChildEBP RetAddr  
00 0bbfed80 760b45c9 ntdll_77100000!NtWaitForSingleObject+0xc
01 0bbfedf4 710858fd KERNELBASE!WaitForSingleObjectEx+0x99
02 0bbfee38 71085849 mscorwks!CLREventWaitHelper+0x2f
03 0bbfee88 71085868 mscorwks!CLREvent::WaitEx+0x117
04 0bbfee9c 711b9651 mscorwks!CLREvent::Wait+0x17
05 0bbfeef4 7114c08d mscorwks!Thread::SysSuspendForGC+0x697
06 0bbfefe4 7114a3ad mscorwks!WKS::GCHeap::SuspendEE+0x194
07 0bbff00c 7114c2fc mscorwks!WKS::GCHeap::GarbageCollectGeneration+0x136
08 0bbff098 71149c43 mscorwks!WKS::gc_heap::try_allocate_more_space+0x15a
09 0bbff0ac 71149cde mscorwks!WKS::gc_heap::allocate_more_space+0x11
0a 0bbff0cc 71082341 mscorwks!WKS::GCHeap::Alloc+0x3b
0b 0bbff0e8 71089c91 mscorwks!Alloc+0x60
0c 0bbff128 71089d56 mscorwks!FastAllocateObject+0x38
0d 0bbff1cc 29fd0b51 mscorwks!JIT_NewFast+0x9e
WARNING: Frame IP not in any known module. Following frames may be wrong.
0e 0bbff1f4 29fd0acf 0x29fd0b51
0f 0bbff214 2c1107fe 0x29fd0acf
10 0bbff22c 24454a20 0x2c1107fe
11 0bbff258 2445497d 0x24454a20
12 0bbff268 2445491f 0x2445497d
13 0bbff27c 2445456d 0x2445491f
14 0bbff2dc 2445409a 0x2445456d
15 0bbff388 24453d8a 0x2445409a
16 0bbff3a8 2c1168b3 0x24453d8a
17 0bbff3b8 2c1165a5 0x2c1168b3
18 0bbff3d8 338cb7a2 0x2c1165a5
19 0bbff3f0 338cb6a9 0x338cb7a2
1a 0bbff408 338cb666 0x338cb6a9
1b 0bbff414 338cb640 0x338cb666
1c 0bbff434 022290e1 0x338cb640
1d 0bbff45c 02227a0b 0x22290e1
1e 0bbff4cc 022267ab 0x2227a0b
1f 0bbff4f8 0222641a 0x22267ab
20 0bbff514 02347746 0x222641a
21 0bbff578 0234736b 0x2347746
22 0bbff5b8 706f88b6 0x234736b
23 0bbff5c4 707022ff mscorlib_ni+0x2188b6
24 0bbff5d8 706f8834 mscorlib_ni+0x2222ff
25 0bbff5f0 71081b6c mscorlib_ni+0x218834
26 0bbff600 710985ab mscorwks!CallDescrWorker+0x33
27 0bbff680 710a064b mscorwks!CallDescrWorkerWithHandler+0xa3
28 0bbff7b8 710a067e mscorwks!MethodDesc::CallDescr+0x19c
29 0bbff7d4 710a069c mscorwks!MethodDesc::CallTargetWorker+0x1f
2a 0bbff7ec 711bd119 mscorwks!MethodDescCallSite::Call+0x1a
2b 0bbff9d4 710b965f mscorwks!ThreadNative::KickOffThread_Worker+0x192
2c 0bbff9e8 710b95fb mscorwks!ManagedThreadBase_DispatchInner+0x4f
2d 0bbffa7c 710b951c mscorwks!ManagedThreadBase_DispatchMiddle+0xb1
2e 0bbffab8 710b96ad mscorwks!ManagedThreadBase_DispatchOuter+0x6d
2f 0bbffae0 711bceea mscorwks!ManagedThreadBase_FullTransitionWithAD+0x25
30 0bbffaf8 711bcfc4 mscorwks!ManagedThreadBase::KickOff+0x13
31 0bbffb94 711a5cf9 mscorwks!ThreadNative::KickOffThread+0x26b
32 0bbffc2c 75d68484 mscorwks!Thread::intermediateThreadProc+0x49
33 0bbffc40 7716302c kernel32!BaseThreadInitThunk+0x24
34 0bbffc88 77162ffa ntdll_77100000!__RtlUserThreadStart+0x2f
35 0bbffc98 00000000 ntdll_77100000!_RtlUserThreadStart+0x1b

Any ideas would be most appreciated; Thanks!

Jason
  • 309
  • 1
  • 3
  • 16
  • I'm not going to try loading the dump file, but: do you have lots and lots of long-lived objects? perhaps large lists/arrays of them? there *are* mechanisms to hold large data in a way that doesn't impact GC, but... is that the issue? – Marc Gravell Nov 02 '18 at 22:57
  • Yes there are lots of short and long-live objects on the managed heap. Would that really cause Garbage Collection to take 60+ seconds? I am wondering if something is preventing GC from starting. E.g. notice Thread 40 says Preemptive GC Disabled. !FinalizeQueue is concerning too: it says generation 0 has 0 finalizable objects, generation 1 has 0 finalizable objects, generation 2 has 64897 finalizable objects, and there are ready for finalization 0 objects – Jason Nov 02 '18 at 23:10
  • Or a sequence of dumps should be captured (both within the hang period and after), so that you can compare the threads to see which threads must be the keys to unlock the culprit. – Lex Li Nov 03 '18 at 02:35
  • It turns out the user did send three dumps taken five minutes apart from each other. I believe one of them was taken after the hang was over. I will look into comparing those tomorrow and hopefully update the bottom of the question – Jason Nov 03 '18 at 03:03
  • Lots of pinned objects can cause issues, as can lots of inappropriately long lived ones. Have your inspected the live objects in the dump to see if there are lots of instances of any of your classes that you wouldn't expect to have long lifetimes? – Damien_The_Unbeliever Nov 05 '18 at 07:19

1 Answers1

2

At least two dozen of my employer's customers are reporting that our .NET 3.5 application will hang/freeze/stop responding, usually for 10 to 90 seconds, and then become responsive again.

I've seen that in cases of high CPU usage, where .NET obviously thought that CPU has higher precedence than garbage collection. Then, in the last moment possible, it performed the garbage collection, taking a long time.

I was surprised to find out that the .NET 3.5 Garbage Collector appears to be suspending the application.

There are three modes, "preemptive", "concurrent" and "background" garbage collection. The preemtive one can suspend the application. AFAIR, there were times when threads were really in suspended mode. This does not seem to be the case any more.

What could be the cause?

0:002:x86> !threadpool
CPU utilization 81%

FYI: 81% is a magic value indicating a garbage collection in progress, also preventing new threads being created in the thread pool.

You have identified thread 37 as the GC thread and that it's suspending the application.

One of the possible causes is a blocking finalizer. However, that does not seem to be the case here. The finalizer thread is still waiting for the finalizer event:

0:002:x86> k
 # ChildEBP RetAddr  
00 062bfa1c 760c5943 ntdll_77100000!NtWaitForMultipleObjects+0xc
01 062bfbb0 760c57f8 KERNELBASE!WaitForMultipleObjectsEx+0x133
02 062bfbcc 7114a136 KERNELBASE!WaitForMultipleObjects+0x18
03 062bfbec 7114e102 mscorwks!WKS::WaitForFinalizerEvent+0x77
04 062bfc00 710b965f mscorwks!WKS::GCHeap::FinalizerThreadWorker+0x49

However, it seems that some threads cannot be suspended (column PreEmptive GC):

                                   PreEmptive   GC Alloc               Lock
        ID OSID ThreadOBJ    State     GC       Context       Domain   Count APT Exception
   37    8 2348 28150ee0      b020 Disabled 681660a4:681660a4 01ef2e28     1 MTA (GC)
   40    7 2f6c 281539d8   180b222 Disabled 00000000:00000000 01ef2e28     0 MTA (Threadpool Worker)

That worker thread seems to be related to those many timer entries you already mentioned:

0:040:x86> k
[...]
0b 0e51f5b0 710b965f mscorwks!AddTimerCallback_Worker+0x66
[...]

Let's see:

AsyncTimerCallbackCompletion TimerInfo@4102d170
--------------------------------------
Number of Timers: 73
--------------------------------------
Completion Port Thread:Total: 1 Free: 1 MaxFree: 8 CurrentLimit: 0 MaxLimit: 1000 MinLimit: 4
0:040:x86> dp 4102d170 L5
4102d170  715e33d8 4102d320 0835739e 711381c9
4102d180  07a67be8

0:040:x86> dp 07a67be8 L2
07a67be8  00000001 01ddd1b0

0:040:x86> !do poi(01ddd1b0)
Name: System.Threading._TimerCallback
[...]

So it seems you're using a System.Threading.Timer.

I'm not 100% sure if I do this part right, but it suggests you're stopping the timer when the callback is invoked:

0:040:x86> k
 # ChildEBP RetAddr  
00 0e51f34c 706c9f4c mscorwks!CompareExchangeMP+0x8
01 0e51f370 702866d7 mscorlib_ni+0x1e9f4c
[...]
0:040:x86> !ip2md 702866d7 
MethodDesc: 6fd71474
Method Name: System.Timers.Timer.Stop()

Please make sure that you're using the Threading Timer correctly

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
  • A big thank you for looking at the dump file :D I am still reading the answer and the link about using Timer correctly, but I strongly suspect you have found the problem. We noticed `mTimer_Elapsed` in `!dumpstack` today, configured the application in such a way that it would run the code that starts and stops the `Timer` **often**, and were actually able to reproduce the issue for the first time. We _are_ stopping the timer from inside the callback. Is that illegal? I'll read the timer link now. I'll accept the answer when I confirm (the way we use) the `Timer` is the problem. – Jason Nov 06 '18 at 00:29
  • We are using `System.Timers.Timer`. We can improve it by setting `AutoReset` to `false`, and then will not need to call `Stop()` in the `Elapsed` handler. Also, I see in the docs for [Timer.Interval](https://learn.microsoft.com/en-us/dotnet/api/system.timers.timer.interval?view=netframework-4.7.2) that if we need to reset the timer (restart the countdown), we can just set `Interval` again instead of disposing and constructing a brand new timer. – Jason Nov 06 '18 at 02:19
  • Also, the timers are saved in an instance variable and carelessly accessed by multiple threads. Looks that there was zero consideration given to race conditions when interacting with the timer itself, nor to reentrancy of the callback method. I would not be surprised if multiple threads were calling `Stop()` on the same timer simultaneously. To make matters worse, the [docs say](https://learn.microsoft.com/en-us/dotnet/api/system.timers.timer?view=netframework-3.5) "Any instance members are not guaranteed to be thread safe." – Jason Nov 06 '18 at 02:23
  • I will make changes to (1) create less Timers (set `Interval` to reset countdown), (2) use `AutoReset` instead of calling `Stop()` in the callback, (3) give some careful thought to the race conditions, and use locks/critical sections if the `Timer` instance really needs to be accessed from multiple threads, and (4) take into account that the callback runs on thread pool and can therefore could technically be queued for execution and even run twice simultaneously. Cheers and thanks again - I'll follow up with how it goes. – Jason Nov 06 '18 at 02:32
  • @Jason: yes, somehow I didn't notice that the last step in the analysis is about a Timers.Timer, not a Threading.Timer. The Threading.Timer is sometimes used in a wrong way by setting the first interval to 0, so it fires immediately, which is not what people want. I'll need to look at it again – Thomas Weller Nov 06 '18 at 07:47
  • Looking at it further, the code around the timer can be improved to instantiate far less timers, and needs some sort of thread synchronization (even a `lock` or `Monitor`) to protect the timer from concurrent access by various threads. What happens if two threads call `Stop()` simultaneously? Possibility benign, possibly harmful, but nothing good, I'm sure :) – Jason Nov 06 '18 at 14:41