68

I am trying to debug some work that processes large files. The code itself works, but there are sporadic errors reported from the .NET Runtime itself. For context, the processing here is a 1.5GB file (loaded into memory once only) being processed and released in a loop, deliberately to try to reproduce this otherwise unpredictable error.

My test fragment is basically:

try {
    byte[] data =File.ReadAllBytes(path);
    for(int i = 0 ; i < 500 ; i++)
    {
        ProcessTheData(data); // deserialize and validate

        // force collection, for tidiness
        GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);
        GC.WaitForPendingFinalizers();
    }
} catch(Exception ex) {
    Console.WriteLine(ex.Message);
    // some more logging; StackTrace, recursive InnerException, etc
}

(with some timing and other stuff thrown in)

The loop will process fine for an non-deterministic number of iterations fully successfully - no problems whatsoever; then the process will terminate abruptly. The exception handler is not hit. The test does involve a lot of memory use, but it saw-tooths very nicely during each iteration (there is not an obvious memory leak, and I have plenty of headroom - 14GB unused primary memory at the worst point in the saw-tooth). The process is 64-bit.

The windows error-log contains 3 new entries, which (via exit code 80131506) suggest an Execution Engine error - a nasty little critter. A related answer, suggests a GC error, with a "fix" to disable concurrent GC; however this "fix" does not prevent the issue.

Clarification: this low-level error does not hit the CurrentDomain.UnhandledException event.

Clarification: the GC.Collect is there only to monitor the saw-toothing memory, to check for memory leaks and to keep things predictable; removing it does not make the problem go away: it just makes it keep more memory between iterations, and makes the dmp files bigger ;p

By adding more console tracing, I have observed it faulting during each of:

  • during deserialization (lots of allocations, etc)
  • during GC (between a GC "approach" and a GC "complete", using the GC notification API)
  • during validation (just foreach over some of the data) - curiously just after a GC "complete" during the validation

So lots of different scenarios.

I can obtain crash-dump (dmp) files; how can I investigate this further, to see what the system is doing when it fails so spectacularly?

Community
  • 1
  • 1
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 4
    Curious why you explicitly call the GC, as there are very few situations where that can be considered good practice. Given your rep I'm sure you have a good reason and curious what it is. – Eric J. Jan 09 '13 at 15:31
  • 3
    @EricJ this is not meant as production code; that GC collect is merely intended to get things into a known state for each iteration, rather than GC randomly in the middle. Removing it doesn't fix the error: it just makes it harder to watch the saw-tooth ;p This entire block of code exists **purely** to stress test this, to repro a reported error. – Marc Gravell Jan 09 '13 at 15:34
  • What does the method ProcessTheData(data) do / look like..? also you should not have to call GC at all also where is that code running..? is it on your local or a remote machine..? start with some of the basic's 1. check to make sure you have proper try{}catch error handling 2. check to make sure that things such as Region settings are consistent 3 put logging in in places where you think things may error.. – MethodMan Jan 09 '13 at 15:35
  • Can you work out where in your code the application is when it terminates? – Rawling Jan 09 '13 at 15:36
  • Have you taken a memory dump w/ something like sysinternals procdump prior to the crash then taken a look w/ windbg? It seems like the LOH may be extremly fragmented from all of these large array allocations, and I'm wondering if that may lead to some sort of memory corruption. The exception code 0xc0000005 also points to a memory access issue... just a possible next step. – Chris Jan 09 '13 at 15:37
  • what if you use .net framework source stepping + disable "justmycode" + use intellitrace events AND call information + set break when exception is to catch all exceptions? – Nahum Jan 09 '13 at 15:42
  • What is the range of successful iterations? Does it usually fail after roughly the same number, or is it all over the place? – Jon B Jan 09 '13 at 15:42
  • In order to rule out an hardware issue, did you try running a memory test tool like memtest86 : http://www.memtest.org/#downiso and CPU stress test ( http://superuser.com/questions/396501/how-can-i-produce-highcpu-load-on-windows ) ? – mathieu Jan 09 '13 at 15:46
  • 1
    Strange question - do you get the same issues if you right click -> run as administrator? It was suggested on a blog I found, written by someone trying to debug the same problem ([link](http://www.jamesewelch.com/2010/09/30/troubleshooting-internal-error-in-the-net-runtime/) if you're interested). – Bridge Jan 09 '13 at 15:48
  • @Rawling I can add a bit more logging, but: only a little – Marc Gravell Jan 09 '13 at 15:49
  • @EricJ. oddly enough, another answer on a related questions suggests this is GC related; I've added some logging, and I'm *hoping* it will turn out to be happening during the GC step. That would be nice and conclusive. – Marc Gravell Jan 09 '13 at 15:54
  • @NahumLitvin there **is** no exception; the execution engine itself terminates – Marc Gravell Jan 09 '13 at 15:54
  • @JonB all over the place; sometimes it'll crash in under 5 iterations; sometimes it'll run for "frustratingly long" happily – Marc Gravell Jan 09 '13 at 15:55
  • 16
    Not sure if relevant, but according to [MSDN](http://msdn.microsoft.com/en-us/library/system.executionengineexception.aspx), the garbage collector can produce this error under heavy load: `In some cases, an application that targets the .NET Framework may throw an ExecutionEngineException exception during garbage collection when an application or the system on which it is running is under a heavy load. As a workaround, you can disable concurrent garbage collection by modifying the application's configuration file. For more information, see How to: Disable Concurrent Garbage Collection.` – Bridge Jan 09 '13 at 16:00
  • @mathieu re memory: this is me investigating something that was originally reported on a completely separate machine. I have tested the memory recently, although not today, but I think that is unlikely – Marc Gravell Jan 09 '13 at 16:00
  • @Bridge yes, that's the theory (found after posting the question) that I'm investigating currently – Marc Gravell Jan 09 '13 at 16:00
  • That's the one. Fixed in 4.5 – Hans Passant Jan 09 '13 at 16:50
  • @Hans damn: 'cos I reproduced it with 4.5 installed (targetting both 4.0 and 4.5) – Marc Gravell Jan 09 '13 at 18:26
  • Damn indeed. Given the randomness and the lack of evidence that 4.5 has a buggy collector, you really ought to consider a hardware problem. Make sure you can repro this on another machine first. – Hans Passant Jan 10 '13 at 11:30
  • @Hans that's just it: it was referred to me from someone else, so definitely reproducible. I will run a fresh memtest later, though. – Marc Gravell Jan 10 '13 at 11:46
  • Well, that's pretty convincing. Can you characterize the code a bit better? How much hackorama is in there? I know you like to squeeze the last ounce of perf out of serialization. Doing anything with the Marshal class, fixed buffers or stackalloc? – Hans Passant Jan 10 '13 at 12:08
  • @HansPassant none of the above; it uses `unsafe` **only** to whack between an Int32<===>Single, of a local by-val parameter; nothing scary. And the error can occur **outside** of deserialization - just in the validation step (the MoveNext of a collection, strangely; but another time in the EnsureCapacity of a collection). Either the CLI is screwy, or `SortedList<,>` is doing something evil! – Marc Gravell Jan 10 '13 at 12:11
  • The code has to operate in the bowels of the CLR to generate a FEEE. EnsureCapacity() is a good one, very likely to trigger a GC. MoveNext() is not, could only happen when background GC is enabled and you already eliminated that. You've got a repro scenario, I'd just take this to Microsoft Support. – Hans Passant Jan 10 '13 at 13:20
  • @Hans the MoveNext was a slightly different scenario: I've only seen that **once**, and it was an AccessViolationException; the single time (out of many) that I've seen an actual .NET exception - basically, memory corruption presumably due to mad stuff happening during GC. – Marc Gravell Jan 10 '13 at 13:23
  • Okay, that makes more sense. Strong hints that the GC is not properly updating an object reference after it compacts the heap. Generates a FEEE when that bad reference is encountered during a GC, an AV if you dereference the object reference in your code after the collection. Sure sounds like the GC still has cooties. – Hans Passant Jan 10 '13 at 13:31
  • @Hans I'll see what colleague-Nick can make of one my more-recent dmp files, then I'll try logging via MS – Marc Gravell Jan 10 '13 at 13:33
  • Try attaching to the process in native mode. The Attach dialog allows to choose the mode. Then set the debugger to stop at all exceptions.; Anyway, this is obviously a CLR bug as your use of unsafe code is clearly safe. I think the best you can do is to bisect the code triggering this bug and working around it. – usr Jan 10 '13 at 14:12
  • @usr problem is: it is happenig in unpredictable and unrelated areas: no single cause, other than lots of memory usage at the time – Marc Gravell Jan 10 '13 at 15:54
  • Have you seen if compiling and running under the Mono runtime does the same? Just might be interesting to see what the contrasts in results come up. – Jesse C. Slicer Jan 15 '13 at 21:24
  • By chance, does the event log have `RADAR_PRE_LEAK_64` as the event name? Also, are all of your assemblies .NET 4.0, or do some of them target the 2.0 runtime? – Christopher Currens Jan 15 '13 at 22:14
  • @ChristopherCurrens I don't remember seeing that name; one targets .NET 3; why? – Marc Gravell Jan 16 '13 at 07:25
  • I've had intermittent program failures, not even triggering WER, with that event name before. Happened on several computers quite randomly when there's heavy CPU usage. My question about the 4.0 vs 2.0 runtime, can be ignored. I was thinking of something that only affects mixed-mode assemblies. You mentioned earlier about reproducing it with 4.5 installed. Have you tested it with .NET 4.0 _only_, that is targeting 4.0 and not having 4.5 installed at all (since it's an in-place update)? – Christopher Currens Jan 16 '13 at 18:42
  • @ChristopherCurrens no, I confess I have not. Due to being in-place, it is regrettably a pain to do... – Marc Gravell Jan 16 '13 at 20:26
  • 4
    Have you managed to figure out what was causing this? – Dan Is Fiddling By Firelight Apr 08 '13 at 14:45
  • Almost two years later, did you managed to fix it ? – Hossein Sep 19 '14 at 03:06
  • @Hossein nope, I never did – Marc Gravell Sep 19 '14 at 07:08
  • hmmm might it be simple ram error? a few month ago I had a problem with my system crushing around aloccating 12GIG of ram it was simply hitting a bad memory sector. and after random amount of time crushing.. – Nahum Sep 28 '14 at 04:51
  • 2
    @Nahum when I asked this, it was via a support email to one of my open source projects that I could reproduce locally. It seems unlikely that we would have exactly the same RAM fault. – Marc Gravell Sep 28 '14 at 07:33

5 Answers5

23

If you have memory dumps, I'd suggest using WinDbg to look at them, assuming that you're not doing that already.

Trying running the comment !EEStack (mixed native and managed stack trace), and see if there's anything that might jump out in the stack trace. In my test program, I found this one of the times as my stack trace where a FEEE happened (I was purposefully corrupting the heap):

0:000> !EEStack
---------------------------------------------
Thread   0
Current frame: ntdll!NtWaitForSingleObject+0xa
Child-SP         RetAddr          Caller, Callee
00000089879bd3d0 000007fc586610ea KERNELBASE!WaitForSingleObjectEx+0x92, calling ntdll!NtWaitForSingleObject
00000089879bd400 000007fc5869811c KERNELBASE!RaiseException+0x68, calling ntdll!RtlRaiseException
[...]
00000089879bec80 000007fc49109cf6 clr!WKS::gc_heap::gc1+0x96, calling clr!WKS::gc_heap::mark_phase
00000089879becd0 000007fc49109c21 clr!WKS::gc_heap::garbage_collect+0x222, calling clr!WKS::gc_heap::gc1
00000089879bed10 000007fc491092f1 clr!WKS::GCHeap::RestartEE+0xa2, calling clr!Thread::ResumeRuntime
00000089879bed60 000007fc4910998d clr!WKS::GCHeap::GarbageCollectGeneration+0xdd, calling clr!WKS::gc_heap::garbage_collect
00000089879bedb0 000007fc4910df9c clr!WKS::GCHeap::Alloc+0x31b, calling clr!WKS::GCHeap::GarbageCollectGeneration
00000089879bee00 000007fc48ff82e1 clr!JIT_NewArr1+0x481

Since this could be related to heap corruption from the garbage collector, I would try the !VerifyHeap command. At least you could make sure that the heap is intact (and your problem lies elsewhere) or discover that your issue might actually be with the GC or some P/Invoke routines corrupting it.

If you find that the heap is corrupt, I might try and discover how much of the heap is corrupted, which you might be able to do via !HeapStat. That might just show the entire heap corrupt from a certain point, though.

It's difficult to suggest any other methods to analyze this via WinDbg, since I have no real clue about what your code is doing or how it's structured.

I suppose if you find it to be an issue with the heap and thus meaning it could be GC weirdness, I would look at the CLR GC events in Event Tracing for Windows.


If the minidumps you're getting aren't cutting it and you're using Windows 7/2008R2 or later, you can use Global Flags (gflags.exe) to attach a debugger when the process terminates without an exception, if you're not getting a WER notification.

In the Silent Process Exit tab, enter the name of the executable, not the full path to it (ie. TestProgram.exe). Use the following settings:

  • Check Enable Silent Process Exit Monitoring
  • Check Launch Monitor Process
  • For the Monitor Process, use {path to debugging tools}\cdb.exe -server tcp:port=5005 -g -G -p %e.

And apply the settings.

When your test program crashes, cdb will attach and wait for you to connect to it. Start WinDbg, type Ctrl+R, and use the connection string: tcp:port=5005,server=localhost.

You might be able to skip using remote debugging and instead use {path to debugging tools}\windbg.exe %e. However, the reason I suggested remote instead, was because WerFault.exe, which I believe is what reads the registry and launches the monitor process, will start the debugger in Session 0.

You can make session 0 interactive and connect to the window station, but I can't remember how that's done. It's also inconvenient, because you'd have to switch back and forth between sessions if you need to access any of your existing windows you've had open.

Bridge
  • 29,818
  • 9
  • 60
  • 82
Christopher Currens
  • 29,917
  • 5
  • 57
  • 77
7

Tools->Debugging->General->Enable .Net Framework Debugging

+

Tools->IntelliTace-> IntelliTaceEbents And Call Information

+

Tools->IntelliTace-> Set StorIntelliTace Recordings in this directory

and choose a directory

should allow you to step INTO .net code and trace every single function call. I tried it on a small sample project and it works

after each debug session it suppose to create a recording of the debug session. it the set directory even if CLR dies if im not mistaken

this should allow you to get to the extact call before CLR collapsed.

Nahum
  • 6,959
  • 12
  • 48
  • 69
  • 1
    doing work that involves 10+GB of memory and takes over a minute per iteration, and might not happen for ages, that might be excessive amounts of logging. Good idea, though. – Marc Gravell Jan 09 '13 at 16:35
3

Try writing a generic exception handler and see if there is an unhandled exception killing your app.

    AppDomain currentDomain = AppDomain.CurrentDomain;
    currentDomain.UnhandledException += new UnhandledExceptionEventHandler(MyExceptionHandler);

static void MyExceptionHandler(object sender, UnhandledExceptionEventArgs e) {
        Console.WriteLine(e.ExceptionObject.ToString());
        Console.WriteLine("Press Enter to continue");
        Console.ReadLine();
        Environment.Exit(1);
Dhawalk
  • 1,257
  • 13
  • 28
  • 2
    I would expect he's already tried this. He states: "The exception handler is not hit." in the question. – ChrisF Jan 09 '13 at 15:38
  • 9
    Alas, this is a lower-level "exception" - 80131506 is an ExecutionEngineException; after that, **no** managed code will run. Good idea, but doesn't work. – Marc Gravell Jan 09 '13 at 15:39
  • I assumed his exception handler is the catch block he has written around the loop. – Dhawalk Jan 09 '13 at 15:39
  • 1
    To be explicit: yes I tried this; no that also does not get hit – Marc Gravell Jan 09 '13 at 15:40
  • 2
    afaik the ExecutionEngineException leads to an immediate process termination since .NET 4.0, so this unfortunately wont be helpfull. – Carsten Jan 09 '13 at 15:41
  • @MarcGravell: Have you tried hosting the CLR using native code, then handling the exception there? – user541686 Jan 10 '13 at 08:17
  • @MarcGravell: Haha yeah... I've never done it myself so I don't know how hard it is lol. Just mentioned it since it might help catch exceptions in native code. :) – user541686 Jan 10 '13 at 08:59
3

I usually invesitgate memory related problems with Valgrind and gdb.

If you run your things on Windows, there are plenty of good alternatives such as verysleepy for callgrind as suggested here:
Is there a good Valgrind substitute for Windows?

If you really want to debug internal errors of the .NET runtime, you have the problem that there is no source for neither the class libraries nor the VM.

Since you can't debug what you don't have, I suggest that (apart from decompiling the .NET framework libraries in question with ILSpy, and adding them to your project, which still doesn't cover the vm) you could use the mono runtime.
There you have both the source of the class libraries as well as of the VM.
Maybe your program works fine with mono, then your problem would be solved, at least as long as it's only a one-time-processing task.

If not, there is an extensive FAQ on debugging, including GDB support
http://www.mono-project.com/Debugging

Miguel also has this post regarding valgrind support:
http://tirania.org/blog/archive/2007/Jun-29.html

In addition to that, if you let it run on Linux, you can also use strace, to see what's going on in the syscalls. If you don't have extensive winforms usage or WinAPI calls, .NET programs usually work fine on Linux (for problems regarding file system case-sensitivity, you can loopmount a case-insensitive file system and/or use MONO_IOMAP).

If you're Windows centric person, this post says the closest thing Windows has is WinDbg's Logger.exe, but ltrace information is not as extensive.

Mono sourcecode is available here:
http://download.mono-project.com/sources/

You are probably interested in the sources of the latest mono version
http://download.mono-project.com/sources/mono/mono-3.0.3.tar.bz2

If you need framework 4.5, you'll need mono 3, you can find precompiled packages here
https://www.meebey.net/posts/mono_3.0_preview_debian_ubuntu_packages/

If you want to make changes to the sourcecode, this is how to compile it:
http://ubuntuforums.org/showthread.php?t=1591370

Community
  • 1
  • 1
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
1

There are .NET exceptions which can not be caught. Check out: http://msdn.microsoft.com/en-us/magazine/dd419661.aspx.

Bridge
  • 29,818
  • 9
  • 60
  • 82
Dalibor Čarapić
  • 2,792
  • 23
  • 38