21

I have twice now seen a NullReferenceException logged from a Production ASP.NET MVC 4 web application - and logged on the wrong line. Not wrong by a line or two (like you would get with a PDB mismatch), but wrong by the length of the entire controller action. Example:

public ActionResult Index()
{
    var someObject = GetObjectFromService();
    if (someObject.SomeProperty == "X") { // NullReferenceException here if someObject == null
        // do something
    }

    // about 40 more lines of code

    return View();    // Stack trace shows NullReferenceException here
}

This has happened twice for actions on the same controller. The second case was logged on

// someObject is known non-null because of earlier dereferences
return someObject.OtherProperty
    ? RedirecToAction("ViewName", "ControllerName")
    : RedirectToAction("OtherView", "OtherController");

This is very disturbing. NullReferenceException is very easy to fix once you know what line it occurs on. It's not quite so easy if the exception could have happened anywhere within the controller action!

Has anyone ever seen anything like this at all, either in ASP.NET MVC or elsewhere? I'm willing to believe it's the difference between a Release build and a Debug build, but still, to be off by 40 lines?


EDIT:

To be clear: I'm the original author of "What is a NullReferenceException and how do I fix it?". I know what a NullReferenceException is. This question is about why could the stack trace be so far off. I have seen cases where a stack trace is off by a line or two due to PDB mismatch. I have seen cases where there is no PDB, so you don't get line numbers. But I have never seen a case where the stack trace is off by 32 lines.

EDIT 2:

Note that this has happened with two separate controller actions within the same controller. Their code is quite different from each other. In fact, in the first case, the NullReferenceException didn't even occur in a conditional - it was more like this:

SomeMethod(someObject.SomeProperty);

There was some chance that the code had been reorganized during optimization so that the actual NullReferenceException occurred closer to the return, and the PDB was in fact only off by a few lines. But I don't see an opportunity to rearrange a method call in a way that would cause the code to move by 32 lines. In fact, I just looked at the decompiled source, and it does not appear to have been rearranged.

What these two cases have in common are:

  1. They occur in the same controller (so far)
  2. In both cases, the stack trace points to the return statement, and in both cases, the NullReferenceException occurred 30 or more lines away from the return statement.

EDIT 3:

I just did an experiment - I just rebuilt the solution using the "Production" build configuration that we have deployed to our Production servers. I ran the solution on my local IIS, without changing the IIS configuration at all.

The stack trace showed the correct line number.

EDIT 4:

I don't know if this is relevant, but the circumstance causing the NullReferenceException is as unusual as this "wrong line number" issue itself. We appear to be losing session state for no good reason (no restarts or anything). That's not too strange. The strange part is that our Session_Start should be redirecting to the login page when that happens. Any attempt to reproduce the session loss causes the redirect to the login page. Subsequently using the browser "Back" button or manually entering the previous URL goes right back to the login page without hitting the controller in question.

So maybe two weird problems is really one very weird problem.

EDIT 5:

I was able to obtain the .PDB file, and to look at it with dia2dump. I thought it was possible that the PDB was messed up, and only had line 72 for the method. That's not the case. All the line numbers are present in the PDB.

EDIT 6:

For the record, this just happened again, in a third controller. The stack trace points directly to the return statement of a method. This return statement is simply return model;. I don't think there's any way for that to cause a NullReferenceException.

Edit 6a:

In fact, I just looked more closely at the log and found several exceptions which are not NullReferenceException, and which still have the stack trace point at the return statement. Both of these cases are in methods called from the controller action, not directly in the action method itself. One of these was an explicitly thrown InvalidOperationException, and one was a simple FormatException.


Here are a couple of facts I haven't thought relevant until now:

  1. The Application_Error in the global.asax is what causes these exceptions to be logged. It picks up the exceptions by using Server.GetLastError().
  2. The logging mechanism logs the message and stack trace separately (instead of logging ex.ToString(), which would have been my recommendation). In particular, the stack trace that I've been asking about comes from ex.StackTrace.
  3. The FormatException was raised in System.DateTime.Parse, called from System.Convert.ToDate, called from our code. The stack trace line that points to our code is the line that points to "return model;".
Community
  • 1
  • 1
John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • Null reference exception is thrown at the point of invocation, not where your null object was created. So, by calling `View()` you are exposing your null object's invocation. – Brian Driscoll Nov 21 '14 at 21:58
  • 1
    Sometimes there could be logic in the view itself that has a null reference. Example a model property or may be model itself. Check your view to make sure nothing is in there that is failing. – Azhar Khorasany Nov 21 '14 at 21:59
  • 1
    I second Azhar's comment. My guess is that your view is calling something on your Model that is null. – Brian Driscoll Nov 21 '14 at 22:00
  • 2
    It is interesting to know why people voting down on "I should get exception here, but instead getting it there" question... – Alexei Levenkov Nov 21 '14 at 22:01
  • @AlexeiLevenkov Because that's not the case. – Brian Driscoll Nov 21 '14 at 22:01
  • 4
    @BrianDriscoll `View()` does not call `Execute()` on the view, so any NRE in the View would not be thrown in the controller's method. I'm inclined to think that the example code has been reduced to the point that it would never be able to reproduce the exception. As it stands this question is unanswerable because it does not have a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – Erik Philips Nov 21 '14 at 22:14
  • @Brian: `View()` returns a `ViewResult` object which is not evaluated during the body of the controller action. Also, I looked at the code of the `View()` method, and there is no way for it to throw a `NullReferenceException`. – John Saunders Nov 21 '14 at 22:20
  • @BrianDriscoll: also, I know where `NullReferenceException` is thrown. My question is about the stack trace, specifically about the line number shown in the stack trace. – John Saunders Nov 21 '14 at 22:21
  • @AzharKhorasany: the view is not evaluated inside the execution of the controller action. It's executed after the action returns. – John Saunders Nov 21 '14 at 22:25
  • this has never happened to me, it would be interesting to see what would cause something like this. If someObject is truly null I would think the if statement should throw nullreference exception. – Reza Shirazian Nov 21 '14 at 22:26
  • @ErikPhilips: the real code doesn't really matter. What matters is that a `NullReferenceException` is being throw on, say, line 40, but the stack trace in the log says line 72. – John Saunders Nov 21 '14 at 22:26
  • @RezaShirazian: I believe that the `if` statement threw the exception. The question is why does the stack trace show the `return` statement as the culprit. – John Saunders Nov 21 '14 at 22:30
  • @JohnSaunders what does visual studio do as it goes through the code? does it throw the exception on the view? how does it behave if there is a break point after the if statement but before the view? I would be interested to see how vs differs on its execution path than what you get on your stacktrace. – Reza Shirazian Nov 21 '14 at 22:37
  • @RezaShirazian: the strange line number comes up when the code is run in Production. When run in a Debug build, the log shows the `NullReferenceException` on the correct line number. Let me reiterate: I _know_ that the `NullReferenceException` is thrown on line 32. The question is why does the stack trace in my log file say that it happened on line **72**? – John Saunders Nov 21 '14 at 22:42
  • Is your controller by chance a partial class? I haven't tested this theory, but I can see where Visual Studio might be wrong if it is trying to merge another 40 line file with this one, but technically they are both the same class. – NightOwl888 Nov 22 '14 at 10:40
  • @NightOwl888: good thought, but no, it's not a partial class. Also, FYI, the line number is correct in a Debug build. – John Saunders Nov 22 '14 at 12:13
  • @Fka: thanks for the thought. Right now, I tend to doubt that optimization is the cause of this. Although I haven't yet looked at the IL code (I don't speak IL), I _did_ look at the decompiled code from the DLL. It showed only minor differences - nothing to account for 40 lines different in the line number. – John Saunders Nov 24 '14 at 17:47
  • @abatishchev: can you characterize the circumstances under which you see the similar behavior? I suspect this is something seen only very rarely, so any characterizations from the [so] community will likely help narrow it down. – John Saunders Nov 24 '14 at 17:48
  • Are you using some postcompilation steps like CodeContracts, PostSharp? These do decompile your dll and weave stuff into your dll while rewriting your dll which might point towards issues in the IL rewriters. – Alois Kraus Nov 24 '14 at 17:53
  • @John: Can't say from top of my head. I think it happens rarely because I get NullReferenceException rarely. I'll try to simulate some in my current MVC project. – abatishchev Nov 24 '14 at 17:55
  • @AloisKraus: no, we're not doing any post-compile stuff. Thanks for asking, though. Your question caused me to look at the build template and the .csproj file. Nothing "interesting", but I can now report that `Optimize` is `true` and `DebugType` is `pdbonly` for the Production build, but `false` and `full` for the `Debug` build. – John Saunders Nov 24 '14 at 18:23
  • As far as I know, contrary to what would happen in C++ (access violation only on member access, not method call), `NullReferenceException`s in the CLR are contractual by the specification, and are generated by `mov [ecx], ecx` instructions specially emitted by the JIT when it cannot prove some reference isn't null. Is it possible to inspect the method's assembly, post JIT? Maybe it's some wacky opcode reordering optimization. – DoomMuffins Nov 24 '14 at 18:45
  • Could the difference between Production and Debug builds be due to the debugger catching the exception as a first chance exception where it gets thrown, but he production builds do not include all the debug symbols and so the closest symbol that is included is the method name? – Evil Dog Pie Nov 24 '14 at 18:48
  • @MikeofSST: I get the correct line number when running either a Debug build or a Production build without the debugger, on my local machine. The strange line number comes from a Production build running on the Production servers. The fact that it's a Production build does not seem to be relevant, as I get the correct line number from a Production build on my local machine. – John Saunders Nov 24 '14 at 18:53
  • @DoomMuffins: it may come to that, if I'm given time to get down to looking at IL or even assembler. – John Saunders Nov 24 '14 at 18:54
  • You can't show us code to reproduce it? You can't track it? you can give the whole code of both methods at least. Sorry if I am ignorant here but I didn't read this 34 comments above mine. – LyingOnTheSky Nov 24 '14 at 19:53
  • 1
    I would try to get a user dump of the exception and then load into windbg with the symbols from the production server. Do you get the correct stack? – Mike Nov 24 '14 at 20:29
  • Also have you verified that the symbols are correct on the production server? – Mike Nov 24 '14 at 20:31
  • @LyingOnTheSky: the problem is that the code that caused the strange line number in Production is the exact same code that produced the _correct_ line number on my local machine. – John Saunders Nov 24 '14 at 20:41
  • @Mike: this is an ASP.NET MVC web application. I don't even know if it's even _possible_ to get a dump. How would I verify the symbols? – John Saunders Nov 24 '14 at 20:42
  • 3
    Use debug diags from msft. This will allow you to get the user dump when the access violation happens. Then you can load into windbg and see what stack you get. Tess' blog may help http://blogs.msdn.com/b/tess/ – Mike Nov 24 '14 at 20:56
  • @JohnSaunders Make sure that your exception and on the other machine is not completely different. I will just keep being around to see how it turns out. – LyingOnTheSky Nov 24 '14 at 20:57
  • You can also check the create time of the pdb file. It should match the compiled dll – Mike Nov 24 '14 at 20:59
  • 1
    @LyingOnTheSky: why I think it's not completely different: 1. It's not possible for those return statements to produce a `NullReferenceException`. 2. I reproduced the problem locally, got the correct line number, quickly fixed the problem, had QA test it, then deployed it to Production, where the problem no longer occurs. That suggests that I fixed the right problem, which is the problem that was reported locally at the correct line, but in Production on the wrong line. Also, the problem has not occurred again in Production. – John Saunders Nov 24 '14 at 21:00

4 Answers4

10

Can PDBs be off more than 2 or 3 lines?

You give the statement that you have never seen PDBs off more than a few lines. 40 lines seem to be too much, especially when the decompiled code doesn't look much of a difference.

However, that's not true and can be proven by a 2 liner: create a String object, set it to nulland call ToString(). Compile and run. Next, insert a 30 line comment, save the file, but do not recompile. Run the application again. The application still crashes, but gives a 30 lines difference in what it reports (line 14 vs. 44 in the screenshot).

It is not related at all to the code that gets compiled. Such things can easily happen:

  • code reformat, which e.g. sorts the methods by visibility, so the method moved up 40 lines
  • code reformat, which e.g. breaks long lines at 80 characters, usually this moves things down
  • optimize usings (R#) which removes 30 lines of unneeded imports, so the method moved up
  • insertion of comments or newlines
  • having switched to a branch while the deployed version (matching the PDB) is from trunk (or similar)

PDBs off by 30 lines

How can this happen in your case?

If it's really as you say and you seriously reviewed your code, there are two potential issues:

  • EXE or DLL does not match to PDBs, which can easily be checked
  • PDBs do not match to source code, which is harder to identify

Multithreading can set objects to null when you least expect it, even if it has been initialized before. In such a case, NullReferenceExceptions can not only be 40 lines away, it can even be in a totally different class and therefore file.

How to continue

Capture a dump

I'd first try to get a dump of the situation. This allows you to capture the state and look at everything in detail without the need of reproducing it on your developer machine.

For ASP.NET, see the MSDN blog Steps to Trigger a User Dump of a Process with DebugDiag when a Specific .net Exception is Thrown or Tess' blog.

In any case, always capture a dump including full memory. Also remember to collect all necessary files (SOS.dll and mscordacwks.dll) from the machine where the crash occured. You can use MscordacwksCollector (Disclaimer: I'm the author of it).

Check the symbols

See if the EXE/DLL really matches to your PDBs. In WinDbg, the following commands are helpful

!sym noisy
.reload /f
lm
!lmi <module>

Outside WinDbg, but still using debugging tools for Windows:

symchk /if <exe> /s <pdbdir> /av /od /pf

3rd party tool, ChkMatch:

chkmatch -c <exe> <pdb>

Check the source code

If PDBs match to DLLs, the next step is to check whether the source code belongs to the PDBs. This is best possible if you commit PDBs to version control together with the source code. If you did that, you can search the matching PDBs in source control and then get the same revision of source code and PDBs.

If you didn't do that, you're unlucky and you should probably not use source code but work with PDBs only. In case of .NET, this works out pretty well. I'm debugging a lot in 3rd party code with WinDbg without receiving the source code and I can get pretty far.

If you go with WinDbg, the following commands are useful (in this order)

.symfix c:\symbols
.loadby sos clr
!threads    
~#s
!clrstack
!pe

Why code is so important on StackOverflow

Also, I looked at the code of the View() method, and there is no way for it to throw a NullReferenceException

Well, other people have made similar statements before. It's easy to overlook something.

The following is a real world example, just minimized and in pseudo code. In the first version, the lock statement didn't exist yet and DoWork() could be called from several threads. Quite soon, the lock statement was introduced and everything went well. When leaving the lock, someobj will always be a valid object, right?

var someobj = new SomeObj(); 
private void OnButtonClick(...)
{
    DoWork();
}

var a = new object();   
private void DoWork()
{
    lock(a) {
        try {
            someobj.DoSomething();
            someobj = null;
            DoEvents();             
        }
        finally
        {
            someobj = new SomeObj();
        }
    }   
}

Until one user reported the same bug again. We were sure that the bug was fixed and this was impossible to happen. However, this was a "double-click user", i.e. someone who does a double click on anything that can be clicked.

The DoEvents() call, which was of course not in such a prominent place, caused the lock to be entered again by the same thread (which is legal). This time, someobj was null, causing a NullReferenceException in a place where it seemed impossible to be null.

That second time, it was return boolValue ? RedirectToAction("A1","C1") : RedirectToAction("A2", "C2"). The boolValue was an expression which could not have thrown the NullReferenceException

Why not? What is boolValue? A property with a getter and setter? Also consider the following (maybe a bit off) case, where RedirectToAction takes only constant parameters, looks like a method, throws an exception but is still not on the callstack. This is why it is so important to see code on StackOverflow...

Screenshot: method with constant parameters not on callstack

Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
9

I have seen this kind of behavior in production code once. Although the details are a little vague (It was approximately 2 years ago, and although I can find the email, I don't have access to the code anymore, and neither the dumps etc.)

FYI, this is what I wrote to the team (very small parts from the large mail) -

// Code at TeamProvider.cs:line 34
Team securedTeam = TeamProvider.GetTeamByPath(teamPath); // Static method call.

"No way null reference exception can happen here."

Later, after more dump diving

"Findings -

  1. The issue was happening in DBI because it did not have root/BRH team. UI is not handling the null returned by CLib gracefully, and hence the exception.
  2. The stack trace shown on UI was misleading, and was due to the fact that Jitter and CPU can optimize / reorder instructions, causing stack traces to "lie".

Digging into a process dump revealed the issue, and has been confirmed that DBI indeed did not have above mentioned team."


I think, the thing to note here is the statement in bold above, in contrast with your analysis and statement -

"I just looked at the decompiled source, and it does not appear to have been rearranged.", or

"Production build running on my local machine shows the correct line number."

The idea is that optimizations can happen at different levels.. and those done at compile time are just some of them. Today, especially with Managed environment like .Net, there are actually relatively fewer optimizations done while emitting IL (Why should 10 compilers for 10 different .Net languages try to do same set of optimizations, when the emitted Intermediate Language code will be further transformed into machine code, either by ngen, or Jitter).

Hence what you have observed, can only be confirmed by looking at the jitted machine code (aka assembly) from a dump from the production machine.


One question I can see coming is - Why would Jitter emit different code on Production machine, compared to your machine, for the same build?

Answer - I don't know. I am not a Jit expert, but I do believe it can... because as I said above.. Today these things are way more sophisticated compared to technologies used 5-10 years ago. Who knows, what all factors.. like "memory, number of CPUs, CPU load, 32 bit vs 64 bit, Numa vs Non-Numa, Number of times a method has executed, how small or big a method is, Who calls it, what does it call, how many times, access patterns for memory locations, etc etc" does it look at while making these optimizations.

For your case, so far only you can reproduce it, and only you have access to your jitted code in production. Hence, (If I may say so :)) this is the best answer anyone can come up with.


EDIT: An important difference between the Jitter on one machine vs the other, can also be the version of the jitter itself. I'd imagine that as several patches and KBs are released for .net framework, who knows what differences of optimization behavior jitter with even minor version differences may have.

In other words, it is not sufficient to assume that both machines have same Major version of the framework (lets say .Net 4.5 SP1). Production may not have patches which are released every day, but your dev / private machine may have patch released last Tuesday.


EDIT 2: Proof of concept - i.e. Jitter optimizations can lead to lying stack traces.

Run following code yourself, Release build, x64, Optimizations on, all TRACE and DEBUG turned off, Visual Studio Hosting Process turned off. Compile from visual studio, but run from explorer. And try to guess which line the stack trace will tell you the exception is at?

class Program
{
    static void Main(string[] args)
    {
        string bar = ReturnMeNull();

        for (int i = 0; i < 100; i++)
        {
            Console.WriteLine(i);
        }

        for (int i = 0; i < bar.Length; i++)
        {
            Console.WriteLine(i);
        }

        Console.ReadLine();

        return;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static string ReturnMeNull()
    {
        return null;
    }
}

Unfortunately, after few attempts, I still cannot reproduce exact problem you have seen (i.e. error on return statement), because only you have access to exact code, and any specific code pattern it may have. Or, once again, it is some other Jitter optimization, which is not documented, and hence hard to guess.

Vikas Gupta
  • 4,455
  • 1
  • 20
  • 40
  • 1
    But the line numbers are original source code line numbers, and the exception is a managed exception. Why would the rearrangements made between IL and machine code have any effect on the relationship between the IL and the original source code line numbers? – bzlm Nov 27 '14 at 09:30
  • Because when the exception stack trace is built, runtime has to walk the stack to build the stack trace.. and what it has at runtime is stack in machine code, not IL code (**When machine code has been rearranged**, I don't think machine code can be traced back to IL that easily.. at least it's not required for the application to function correctly, and obviously, lack of that machine code to IL code mapping, is the kind of scenario where it bites). – Vikas Gupta Nov 27 '14 at 09:34
  • Vikas, thank you very much. This is a very interesting thought. One thing I'll have to check: our Production servers are load-balanced. I wonder if they all run the same hardware. Certainly, that hardware is different from my local machine (laptop). – John Saunders Nov 27 '14 at 18:10
3

Just a thought, but the one thing that I can think of is that perhaps there is a possibility that your build definition/configuration is pushing out an out of synch compiled version of your application dll(s) and this is why you see the discrepancy on your machine when you look up the line number from the stacktrace.

mreyeros
  • 4,359
  • 20
  • 24
  • 1
    I'll check the date, but this was the first deployment of this application to Production, so there cannot have been a previous version with a 40-line difference. – John Saunders Nov 24 '14 at 21:11
1

The problem and its symptoms smells of a hardware problem, eg:

We appear to be losing session state for no good reason (no restarts or anything).

If using InProc Session State Storage switch to out of process. This will help you isolate the problem of losing sessions from the symptom of mismatched PDB line numbers on the NRE you're reporting. If using out of process storage, run some diagnostic utilities on the server.

ps post the output from DebugDiag. I probably should have put this answer as a comment but there are too many already, need to space them out and comment on different diagnostic steps separately.

Jeremy Thompson
  • 61,933
  • 36
  • 195
  • 321
  • If you cannot provide a dump publicly then perhaps your best port of call is Microsoft PSS. Its a good service, they have the windbg experts and the private symbols too. Very likely to save you days from reading Tess's blog and a topic they would genuinely be interested in getting to the bottom of, good luck! – Jeremy Thompson Nov 29 '14 at 01:03
  • 1
    Jeremy, I'll suggest they try PSS. BTW, we are using SQL Server session state. I occasionally see exceptions about failure to connect to the SQL Server. Far fewer than the number of "we lost session state" exceptions, and not apparently correlated in time. Unfortunately, users have not been reporting problems, so this is not a high priority. – John Saunders Nov 29 '14 at 02:19