I want to know if anyone have ever encounter the same situation with me, that a NullReferenceException
occurred from a place which is impossible.
Here's the exact same code piece that throws the exception:
private readonly StatePair[] _pairs =
{
new StatePair(0),
new StatePair(1),
new StatePair(2)
};
public void CommitAll()
{
var states = new State[_pairs.Length];
// collect in reverse order
for (var i = _pairs.Length - 1; i >= 0; i--)
{
// The only exit of CaptureState() method is a "new State()" statement.
states[i] = _pairs[i].CaptureState();
if (states[i] == null)
throw new ApplicationException("The state captured is null.");
}
// commit in normal order
foreach (var s in states)
{
if (s == null)
throw new ApplicationException("The state is null.");
s.Commit(); // *** NullReferenceException ***
}
}
void Run() // Thread start
{
while (true)
{
CommitAll();
Thread.Sleep(200);
}
}
Stacktrace:
System.NullReferenceException: Object reference not set to an instance of an object.
at ...SystemUpdateCoordinator.CommitAll() in ...SystemUpdateCoordinator.cs:line 217
at ...SystemUpdateCoordinator.Run() in ...SystemUpdateCoordinator.cs:line 22
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
From the stack trace we can tell that s.Commit()
throws NullReferenceException
so it seems that s
is null, but we've already check nulls and throw ApplicationException
after each CaptureState()
. If the CaptureState
returns now, we should get an ApplicationException
from the body of for
instead of the NullReferenceException
inside foreach
.
The scare thing is, not all the users get the exception (it's an WPF application). We have about 10 users running this new version and only four of them meet the error, but it happened to them multiple times in two days. We're in a large firm so all the user's environment are restrictedly controlled and should be the same (.NET 4.0, Win 7).
Also here're some more details that might help:
Is multi-threading involved?
Yes, but the only part might be access concurrently is the MyPair
object, and in CaptureState
method it holds the writer lock from a ReaderWriterLockSlim
. I've carefully reviewed the other methods in MyPair
and all of them are protected by the reader lock of the same ReaderWriterLockSlim
instance. Also, CaptureState()
returns a new instance each time. All the rest variables are local (like s
) and they won't have threading issue.
Any other related error happen at the same time?
Yes, actually there's another error occurred with the same new version of the application, which is an access violation error. It should be the same as this one, but I cannot make 100% sure that's the cause of the NullReferenceException
for two reasons:
- These two are not always happen together, although they happen together in most time.
- If the concurrent GC bug corrupts the memory which brings the
NullReferenceException
, why it always happen to this the same statement? Why it doesn't happen in other places?
This issue bothers me for several days. I'm considering it as a runtime issue, but that's not something that I can prove or prove it wrong to others. I'm also trying to change the CommitAll()
method a little bit to see if we can get different assembly code.
I'll post updates here if I have more info.
==== Update 1 (20150603) ====
Some people asked if the stark trace is accurate, and I think it is.
First of all there's no inlining happen to the CommitAll()
method. I've dumped the assembly code and it calls the Commit
method without inline.
Also, in the first place there's no throw new ApplicationException()
checks here. The code is as simple as below since it's impossible to have null here:
public void CommitAll()
{
var states = new State[_pairs.Length];
for (var i = _pairs.Length - 1; i >= 0; i--)
{
states[i] = _pairs[i].CaptureState();
}
foreach (var s in states)
{
s.Commit(); // <- Line 209
}
}
And the stack trace pointed to Line 209 which is the s.Commit()
. So I added some null checks, and the stack trace becomes Line 217 which is exactly the new position of the same statement.
==== Update 2 (20150609) ====
The crash only happened once in the past week, but strangely the WER (Windows Error Reporting) didn't generate a mini crash dump. The event log still said:
Application: MyApp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an internal error int the .NET Runtime at IP 000007FEEC0F25E8 (000007FEEC0B0000) with exit code 80131506.
It was different with the previous error logs said it was an access violation explicitly. The user said he doesn't see the WER dialog, so the application just crash suddenly.
So I can only check the old dumps from the old crashes.
One of the them says it was a "famous concurrent GC issue":
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(2c90.381c): Access violation - code c0000005 (first/second chance not available)
clr!WKS::gc_heap::mark_object_simple+0x19a:
000007fe`eb13dc87 f70100000080 test dword ptr [rcx],80000000h ds:00000000`00000000=????????
0:008> kb
RetAddr : Args to Child : Call Site
000007fe`eb13dd8e : 00000000`b2103c48 00000000`2c92d850 00000000`2c92e9c0 00000000`b2023f08 : clr!WKS::gc_heap::mark_object_simple+0x19a
000007fe`eb1de7be : 00000000`b2103c48 00000000`2c92e9c0 00000000`2c92c2e0 00000000`2c92d7b0 : clr!WKS::GCHeap::Promote+0x79
000007fe`eb1e04ec : 00000000`2c92bec0 00000000`00000007 00000000`00000002 00000000`00000071 : clr!GcEnumObject+0x37
000007fe`eb1dfdd5 : 000007fe`00000000 00000000`00000000 00000000`00000000 000007fe`00000008 : clr!GcInfoDecoder::EnumerateLiveSlots+0x5dd
000007fe`eb1dea47 : 00000000`00080000 00000000`00000000 00000000`0230b540 00000000`2c92c2e0 : clr!EECodeManager::EnumGcRefs+0x13d
000007fe`eb09109f : 00000000`00000000 000007ff`02668548 00000000`2c92d7b0 00000000`00000000 : clr!GcStackCrawlCallBack+0x1e2
000007fe`eb090a6b : 00000000`2c92cb50 00000000`00000000 00000000`00000002 00000000`00000000 : clr!Thread::MakeStackwalkerCallback+0x2f
000007fe`eb0900f2 : 00000000`1fc3ab10 00000000`1fc3ab10 000000f8`00000000 00000000`00000002 : clr!Thread::StackWalkFramesEx+0x8d
000007fe`eb1de6e8 : 00000000`00000000 00000000`00000001 00000000`00000000 00000000`1fc3ab10 : clr!Thread::StackWalkFrames+0xb1
000007fe`eb13d831 : 00000000`00000000 00000000`2c92d880 00000000`00000000 00000000`023ade80 : clr!CNameSpace::GcScanRoots+0x1a4
000007fe`eb13de6d : 00000000`00000000 00000000`00000000 000007fe`eb8ea160 000007fe`eb13e6b6 : clr!WKS::gc_heap::mark_phase+0xe1
000007fe`eb29fcfd : 00000107`e77ad732 00000000`2c92d959 00000000`00000000 00000000`00000000 : clr!WKS::gc_heap::gc1+0xae
000007fe`eb13e44e : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!WKS::gc_heap::garbage_collect+0x42e
000007fe`eb13d3fe : 00000000`00000030 00000000`00000002 000007fe`eb060000 00000000`00000000 : clr!WKS::GCHeap::GarbageCollectGeneration+0x14e
000007fe`eb13d12e : 00000000`0000001d 00000000`b2526b80 00000000`00000002 00000000`00000030 : clr!WKS::gc_heap::try_allocate_more_space+0x25f
000007fe`eb08d418 : 000007fe`e8b93028 00000000`00000030 00000000`00000002 00000000`2c92ddf0 : clr!FastAllocateObject+0x73e
Unable to load image C:\Windows\assembly\NativeImages_v4.0.30319_64\System.Core\84fa340f30d1921e0d8817f9344ee367\System.Core.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for System.Core.ni.dll
000007fe`e8b25049 : 000007fe`e8b93028 00000000`b2526b00 00000000`b2526b00 000007fe`e8b24f9a : clr!JIT_NewFast+0xb8
Unable to load image C:\Windows\assembly\NativeImages_v4.0.30319_64\mscorlib\79d73b390cca60b8a1c1d1228c771f2f\mscorlib.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
000007fe`ea18e458 : 00000000`b2526b00 00000000`00000004 00000000`b2526b00 00000000`00000000 : System_Core_ni+0x2a5049
000007fe`e8b37280 : 000007fe`e88849d8 000007fe`ea17efb0 00000000`0328be48 00000000`b24ee798 : mscorlib_ni+0x3ae458
000007ff`027790bc : 000007fe`e88dc738 00000000`b2524ed0 00000000`03137038 00000000`2c92dde8 : System_Core_ni+0x2b7280
000007ff`027789a0 : 00000000`00000000 00000000`00000000 000007ff`03b5ba48 00000000`067059f0 : 0x7ff`027790bc
...
000007fe`ea15169c : 00000000`04118280 00000000`0416ea50 00000000`0416ea50 00000000`00000000 : 0x7ff`024799c0
000007fe`ea1515ab : 00000000`0416ea50 00000000`00000000 00000000`00000000 000007fe`eb0a5a1f : mscorlib_ni+0x37169c
000007fe`ea1e6d8d : 00000000`0416ea50 00000000`00000000 00000000`00000000 00000000`00000000 : mscorlib_ni+0x3715ab
000007fe`eb09c9e4 : 00000000`0416ea78 00000000`00000000 00000000`00000000 00000000`00000000 : mscorlib_ni+0x406d8d
000007fe`eb09caf9 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!CallDescrWorker+0x84
000007fe`eb09cb75 : 00000000`2c92ef48 00000000`00000001 00000000`2c92ef50 00000000`2c92f1a0 : clr!CallDescrWorkerWithHandler+0xa9
000007fe`eb09d0ac : 00000000`2c92f198 000007fe`ea0e4860 00000000`2c92f230 000007fe`e9eace7c : clr!MethodDesc::CallDescr+0x2a1
000007fe`eb16de50 : 00000000`2c92f640 00000000`2c92f220 00000000`2c92f6e0 000007fe`ea2bdf28 : clr!MethodDesc::CallTargetWorker+0x44
000007fe`eb1008e6 : 00000000`1fc3ab10 00000000`2c92f640 00000000`1fc3ab10 00000000`00001000 : clr!ThreadNative::KickOffThread_Worker+0x148
000007fe`eb10087b : 00000000`00000000 00000000`1fc3ab10 ffffffff`fffffffe 00000000`1fc3ab10 : clr!QueueUserWorkItemManagedCallback+0x92
000007fe`eb1007e8 : 00000000`00000480 00000000`004d0000 00000000`00000000 00000000`00000478 : clr!PEDecoder::CheckILOnlyImportDlls+0x294
000007fe`eb10094b : ffffffff`ffffffff 00000000`1fc3ab10 00000000`00000000 00000000`00000000 : clr!StubLinkerCPU::X86EmitPushReg+0x135
000007fe`eb16dca0 : 00000000`1fc3ab10 00000000`2c92fd80 00000000`00000001 00000000`00000000 : clr!COMArrayInfo::GetReference+0x12b
000007fe`eb22c736 : 00000000`28ef64b0 00000000`2c92f6f8 00000000`1fc3ab10 00000000`772cb98e : clr!ThreadNative::KickOffThread+0xc0
00000000`771959cd : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!Thread::intermediateThreadProc+0x7d
00000000`772cb891 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
The other one says something different:
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(1714.ed4): Access violation - code c0000005 (first/second chance not available)
clr!VirtualCallStubManager::ResolveWorker+0x4a3:
000007fe`edb827d8 488b8980000000 mov rcx,qword ptr [rcx+80h] ds:00000101`00000c84=????????????????
0:012> kb
RetAddr : Args to Child : Call Site
000007fe`edb82713 : 00000000`00e2f360 00000000`2c50e740 00000000`00000000 800003ff`825b7314 : clr!VirtualCallStubManager::ResolveWorker+0x4a3
000007fe`edb43585 : 00000000`00000003 000007ff`0442a6b0 00000000`a004a410 00000000`8bd38210 : clr!VirtualCallStubManager::ResolveWorkerStatic+0x213
000007ff`02b416fb : 00000000`b7476e68 00000000`2c50e910 00000000`00e2bfd0 00000000`00000000 : clr!ResolveWorkerAsmStub+0x95
000007ff`02b41344 : 00000000`0a48bc20 00000000`a0444b88 00000000`8bd37108 000007fe`edbe45e4 : 0x7ff`02b416fb
...
000007ff`0245b040 : 00000000`000000fa 00000000`04157e38 00000000`045a09d0 00000000`00000000 : 0x7ff`0245b3e4
Unable to load image C:\Windows\assembly\NativeImages_v4.0.30319_64\mscorlib\79d73b390cca60b8a1c1d1228c771f2f\mscorlib.ni.dll, Win32 error 0n2
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
000007fe`ea7e169c : 00000000`0431c4a0 00000000`045a09d0 00000000`045a09d0 00000000`00000000 : 0x7ff`0245b040
000007fe`ea7e15ab : 00000000`045a09d0 00000000`00000000 00000000`00000000 000007fe`edb85a1f : mscorlib_ni+0x37169c
000007fe`ea876d8d : 00000000`045a09d0 00000000`00000000 00000000`00000000 00000000`00000000 : mscorlib_ni+0x3715ab
000007fe`edb7c9e4 : 00000000`045a09f8 00000000`00000000 00000000`00000000 00000000`00000000 : mscorlib_ni+0x406d8d
000007fe`edb7caf9 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!CallDescrWorker+0x84
000007fe`edb7cb75 : 00000000`2c50f168 00000000`00000001 00000000`2c50f170 00000000`2c50f3c0 : clr!CallDescrWorkerWithHandler+0xa9
000007fe`edb7d0ac : 00000000`2c50f3b8 000007fe`ea774860 00000000`2c50f450 000007fe`ea53ce7c : clr!MethodDesc::CallDescr+0x2a1
000007fe`edc4de50 : 00000000`2c50f860 00000000`2c50f440 00000000`2c50f900 000007fe`ea94df28 : clr!MethodDesc::CallTargetWorker+0x44
000007fe`edbe08e6 : 00000000`1c696620 00000000`2c50f860 00000000`1c696620 00000000`00001000 : clr!ThreadNative::KickOffThread_Worker+0x148
000007fe`edbe087b : 00000000`00000000 00000000`1c696620 ffffffff`fffffffe 00000000`1c696620 : clr!QueueUserWorkItemManagedCallback+0x92
000007fe`edbe07e8 : 000007ff`fffdc000 00000000`00000002 00000000`00000002 000007fe`f51f7163 : clr!PEDecoder::CheckILOnlyImportDlls+0x294
000007fe`edbe094b : ffffffff`ffffffff 00000000`1c696620 00000000`00000000 00000000`00000000 : clr!StubLinkerCPU::X86EmitPushReg+0x135
000007fe`edc4dca0 : 00000000`1c696620 00000000`2c50ff20 00000000`00000001 00000000`00000000 : clr!COMArrayInfo::GetReference+0x12b
000007fe`edd0c736 : 00000000`21062e70 00000000`2c50f918 00000000`1c696620 00000000`00000000 : clr!ThreadNative::KickOffThread+0xc0
00000000`76eb59cd : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : clr!Thread::intermediateThreadProc+0x7d
00000000`770eb891 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0xd
00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x1d
Also, the user is using .NET 4.0, but all developer's machines are using .NET 4.5 (but targets to .NET 4.0). That explains why the GC bug only happens on the user.