25

I have a computationally-expensive multi-threaded C# app that seems to crash consistently after 30-90 minutes of running. The error it gives is

The runtime has encountered a fatal error. The address of the error was at 0xec37ebae, on thread 0xbcc. The error code is 0xc0000005. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.

(0xc0000005 is the error-code for Access Violation)

My app does not invoke any native code, or use any unsafe blocks, or even any non-CLS compliant types like uint. In fact, the line of code that the debugger says caused the crash is

overallLength += distanceTravelled;

Where both values are of type double


Given all this, I believe the crash must be due to a bug in the compiler or CLR or JIT. I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin. I've never had to view the CIL-binary, or the compiled JIT output, or the native stacktrace (there is no managed stacktrace at the time of the crash), so I'm not sure how. I can't even figure out how to view the state of all the variables at the time of the crash (VS unfortunately won't tell me like it does after managed-exceptions, and outputting them to console/a file would slow down the app 1000-fold, which is obviously not an option).

So, how do I go about debugging this?


[Edit] Compiled under VS 2010 SP1, running latest version of .Net 4.0 Client Profile. Apparently it's ".Net 4.0C/.Net 4.0E, .Net CLR 1.1.4322"

Community
  • 1
  • 1
BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
  • Are you sure it's not a memory problem on your computer? I have seen the same error code happen in other programs when a pointer gets corrupted and set to invalid memory locations. – Matthew Sep 25 '12 at 20:13
  • @Matthew: I will run some memory tests, and try to reproduce it on another machine. But I am doubtful - the program is not very memory-intensive, and I am not having any issues with any other program. Also, it always crashes on the same line. – BlueRaja - Danny Pflughoeft Sep 25 '12 at 20:18
  • I'm not familiar with the CLR, but on the Sun JVM there's a flag that will log all methods that are compiled -- it might help if you can determine (1) whether the JIT was invoked, and (2) how long afterward the crash occurred. There's probably a tool that will let you inspect the memory map to find out what's loaded at that address (or even whether it's in a code block). – parsifal Sep 25 '12 at 20:23
  • Not at all an answer but put in a line overlenght = overlenght + 0; Is it that call or that line number or the line beforee ...? – paparazzo Sep 25 '12 at 20:25
  • I've encountered weird issues like this; un-installing and re-installing .NET seemed to fix them. – Peter Ritchie Sep 25 '12 at 20:26
  • Attach VS debugger as mixed mode debugging - GPF is native exception - so you may have better call stack. You may also try WinDbg to get callstack, but I hope VS would be enough. – Alexei Levenkov Sep 25 '12 at 20:26
  • 1
    @AlexeiLevenkov: An answer detailing how to do those and similar tricks is exactly what I'm looking for. – BlueRaja - Danny Pflughoeft Sep 25 '12 at 20:28
  • @BlueRaja-DannyPflughoeft "An answer detailing how to do those and similar tricks" = a shopping list, which would be closed as not constructive. – casperOne Sep 25 '12 at 20:45
  • Please update your question to include the version of .NET Framework/OS/Visual Studio you are using – Jehof Oct 01 '12 at 06:07
  • 4
    @BlueRaja-DannyPflughoeft can you please respond to some of Jon's questions: eg, have you been able to reproduce on another machine? can you reproduce after updating everything on your current PC? Can you reproduce it readily (ie on demand even tho it takes 90 min) ? If the answer to any of these is Yes then you should start your process via WinDbg which will break before your app exits (from crashing) so you can see whats happening – wal Oct 03 '12 at 12:31
  • http://blogs.msdn.com/b/dsvc/archive/2009/06/25/floating-point-exceptions-in-managed-code-resulting-in-access-violation-crash.aspx – NickD Oct 07 '12 at 19:07
  • @Snoopy: That's interesting, but I don't get those other exceptions, and I have no native code in my program. – BlueRaja - Danny Pflughoeft Oct 07 '12 at 20:13
  • Out of curiosity what values are in overallLength and distanceTravelled before they're added together? – Derek Tomes Oct 08 '12 at 00:38
  • @DerekTomes: I have no idea, the VS debugger won't tell me. – BlueRaja - Danny Pflughoeft Oct 08 '12 at 01:03
  • i know this is not your problem, but an fpu overflow could be the problem. – NickD Oct 08 '12 at 18:20

7 Answers7

23

I'd like to figure out what causes it, or at the very least write a smaller reproduction to send into Microsoft, but I have no idea where to even begin.

"Smaller reproduction" definitely sounds like a great idea here... even if "smaller" won't mean "quicker to reproduce".

Before you even start, try to reproduce the error on another machine. If you can't reproduce it on another machine, that suggests a whole different set of tests to do - hardware, installation etc.

Also, check you're on the latest version of everything. It would be annoying to spend days debugging this (which is likely, I'm afraid) and then end up with a response of "Yes, we know about this - it was a bug in .NET 4 which was fixed in .NET 4.5" for example. If you can reproduce it on a variety of framework versions, that would be even better :)

Next, cut out everything you can in the program:

  • Does it have a user interface at all? If possible, remove that.
  • Does it use a database? See if you can remove all database access: definitely any output which isn't used later, and ideally input too. If you can hard code the input within the app, that would be ideal - but if not, files are simpler for reproductions than database access.
  • Is it data-sensitive? Again, without knowing much about the app it's hard to know whether this is useful, but assuming it's processing a lot of data, can you use a binary search to find a relatively small amount of data which causes the problem?
  • Does it have to be multi-threaded? If you can remove all the threading, obviously that may well then take much longer to reproduce the problem - but does it still happen at all?
  • Try removing bits of business logic: if your app is componentized appropriately, you can probably fake out whole significant components by first creating a stub implementation, and then simply removing the calls.

All of this will gradually reduce the size of the app until it's more manageable. At each step, you'll need to run the app again until it either crashes or you're convinced it won't crash. If you have a lot of machines available to you, that should help...

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 4
    Just an update: it appears [ponsfonze found my problem](http://stackoverflow.com/a/14001081/238419). And of course, it's an issue that was fixed in .Net 4.5... I should have listened to you! – BlueRaja - Danny Pflughoeft Jan 02 '13 at 18:48
10

tl;dr Make sure you're compiling to .Net 4.5


This sounds suspiciously like the same error found here. From the MSDN page:

This bug can be encountered when the Garbage Collector is freeing and compacting memory. The error can happen when the Concurrent Garbage Collection is enabled and a certain combination of foreground Garbage Collection and background Garbage Collection occurs. When this situation happens you will see the same call stack over and over. On the heap you will see one free object and before it ends you will see another free object corrupting the heap.

The fix is to compile to .Net 4.5. If for some reason you can't do this, you can also disable concurrent garbage collection by disabling gcConcurrent in the app.config file:

<configuration>
   <runtime>
       <gcConcurrent enabled="false"/>
   </runtime>
</configuration>

Or just compile to x86.

Community
  • 1
  • 1
ponsfonze
  • 200
  • 1
  • 5
6

Download Debug Diagnostic Tool v1.2

  1. Run program
  2. Add Rule "Crash"
  3. Select "Specific Process"
  4. on page Advanced Configuration set your exception if you know on which exception it fails or just leave this page as is
  5. Set userdump location

Now wait for process to crash, log file is created by DebugDiag. Now activate tab Advanced Analysis, select Crash/Hang Analyzers in top list and dump file in lower list and hit Start Analysis. This will generate html report for you. Hopes you found usefull info in that report. If you have problem with analyze, upload html report somewhere and place url here so we can focus on it.

psulek
  • 4,308
  • 3
  • 29
  • 37
4

My app does not invoke any native code, or use any unsafe blocks, or even any non-CLS compliant types like uint

You may think this, but threading, synchronization via semaphore, mutex it any handles all are native. .net is a layer over operating system, .net itself does not support pure clr code for multithreading apps, this is because OS already does it.

Most likely this is thread synchronization error. Probably multiple threads are trying to access shared resource like file etc that is outside clr boundary.

You may think you aren't accessing com etc, but when you call certain API like get desktop folder path etc it is called through shell com API.

You have following two options,

  1. Publish your code so that we can review the bottleneck
  2. Redesign your app using .net parallel threading framework, which includes variety of algorithms requiring CPU intensive operations.

Most likely programs fail after certain period of time as collections grow up and operations fail to execute before other thread interfere. For example, producer consumer problem, you will not notice any problem till producer will become slower or fail to finish its operation before consumer kicks in.

Bug in clr is rare, because clr is very stable. But poorly written code may lead error to appear as bug in clr. Clr can not and will never detect whether the bug is in your code or in clr itself.

Akash Kava
  • 39,066
  • 20
  • 121
  • 167
  • 1
    *"Redesign your app using .net parallel threading framework"* - That **is** what it uses.. – BlueRaja - Danny Pflughoeft Oct 01 '12 at 07:55
  • If you will post some internals of your code, then we can guide more on it. – Akash Kava Oct 01 '12 at 08:49
  • I would strongly suggest looking at the multi-threading issues first. It's likely this could cause memory corruption. This would actually give a good starting point for making a reproduction, create a multithreaded app that does a lot of your common calculations. – MikeKulls Oct 07 '12 at 23:16
1
  • Did you run a memory test for your machine as the one time I had comparable symptoms one of my dimms turned out to be faulty (a very good memorytester is included in Win7; http://www.tomstricks.com/how-to-test-your-ram-or-memory-with-windows-memory-diagnostic-tool-in-windows-7/)

  • It might also be a heating/throttling issue if your CPU gets too hot after this period of time. Although that would happen sooner imho.

  • There should be a dumpfile that you can analyze. If you never did this find someone who did, or send that to microsoft

IvoTops
  • 3,463
  • 17
  • 18
0

I will suggest you open a support case via http://support.microsoft.com immediately, as the support guys can show you how to collect the necessary information.

Generally speaking, like @paulsm4 and @psulek said, you can utilize WinDbg or Debug Diag to capture crash dumps of the process, and within it, all necessary information is embedded. However, if this is the very first time you use those tools, you might be puzzled. Microsoft support team can provide you step by step guidance on them, or they can even set up a Live Meeting session with you to capture the data, as the program crashes so often.

Once you are familiar with the tools, in the future you can perform similar troubleshooting more easily,

http://blogs.msdn.com/b/lexli/archive/2009/08/23/when-the-application-program-crashes-on-windows.aspx

BTW, it is too early to say "I've found a bug". Though you cannot obviously find in your program a dependency on native code, it might still have a dependency on native code. We should not draw a conclusion before debugging further into the issue.

Lex Li
  • 60,503
  • 9
  • 116
  • 147