How do I troubleshoot an illegal memory access crash that only occurs on a client's system?

Question

I am attempting to troubleshoot a problem with our application that only occurs on a particular server belonging to one of our customers.

The application sometimes crashes, and the core files are showing an illegal memory access. I suspect the reason for that is some kind of failure with the malloc function. It is probably returning a NULL pointer, but when this occurs the machine still has plenty of free memory. My theory is that the memory was too fragmented, and when it tried to allocate some more memory (18MB), it may have failed.

What steps can I take to troubleshoot this problem? For example, does Windows log any information when a memory allocation fails? Or does it just ignore it?

The server in question is running Windows Server 2008 R2 and the Windows Event Log service is running.

At this point I can't include any code, because I don't know what part of the application is causing the problem. How can I narrow this down?

There is no log event for a failed memory allocation. You don't even know if `malloc()` is even performing an OS-level allocation to begin with, it could be using a private internal memory pool instead and failing if the pool does not have adequate memory. — Remy Lebeau, Dec 02 '15 at 19:15
`malloc` does not normally go to Windows, which assigns an amount of memory on the program's heap when the program is executed. 18MB is not a very large amount of memory. Please post the the [MCVE](http://stackoverflow.com/help/mcve) which illustrates the fault. Memory allocation faults are almost always caused by a coding error. — Weather Vane, Dec 02 '15 at 19:15
Are you testing the return for NULL whenever you call malloc? — stark, Dec 02 '15 at 19:22
It actually allocates and deallocates a lot of memory chunks with sizes that ranges from 1MB to 18MB. That is way my suspicion is more related to memory fragmentation, than actually running out of free memory. — Camiel Coppelmans, Dec 02 '15 at 19:25
'probably returning NULL pointer' - are you saying you do not check the result of `malloc()`? Also, is it C or C++? — SergeyA, Dec 02 '15 at 19:26
'It is **probably** returning a NULL pointer, but the machine still have plenty of free memory.' Why don't you just check the return value of malloc in your code and perform the logging yourself? That way you can be sure that it was a failing malloc that caused the crash and proceed to investigate why. — JonatanE, Dec 02 '15 at 19:28
The code is not checking the return value of the malloc call, I will surely add this to the code. The problem that it only happens in one particular server deployed in one of our customers, so it is difficult to try new versions of the application. It is an old code (about 10 years old) of a very widely deployed system. We never saw this crash before, but in this one server. — Camiel Coppelmans, Dec 02 '15 at 19:36
If you've got a memory dump, you should be able to at a minimum distinguish between a null pointer deference and other sorts of memory access errors. I think you should also be able to tell whether the problem is occurring in your code - under the circumstances it is just about as likely to be in third-party code that is being loaded into your process for one reason or another. — Harry Johnston, Dec 02 '15 at 23:43
You say "This is not a question about code errors", but I believe it is. To directly answer your question, no Windows does not log failed memory allocation because it is not the OS doing the allocation but the C standard library. However, the reason I believe you're asking this question is really because the program in question is not allocating/ freeing memory properly. First make sure that the return value of `malloc`/`calloc`/`realloc` are checked each and every time. Then use a memory checking tool to find out where it's misbehaving. — dbush, Dec 03 '15 at 15:06
See [this link](http://stackoverflow.com/questions/413477/is-there-a-good-valgrind-substitute-for-windows) for some memory checking tools for Windows. — dbush, Dec 03 '15 at 15:06
I've rewritten the question to be better suited for SO, and voted to reopen. (It might still be considered too broad in which case it probably won't attract enough reopen votes.) You can revert my edit if you wish but the literal answer to the question as originally posted is just "no" so there's probably not much point. :-) — Harry Johnston, Dec 03 '15 at 20:34
To answer the question, "How can I narrow this down", the simplest thing is a stack dump. If the app is unmanaged you can open the dump file with Windbg and enter `kv` (or one of its many variants). The second thing is to disassemble the code before the faulting instruction: you can use `ub` or `u eip-40`. If the app is managed you can use the sos debug extension. I googled just now and got a number of hits on SO. — Χpẘ, Dec 03 '15 at 22:27
This is one of the most difficult situations in the industry. Ideally you want to be able to go on-site, and bring a laptop with you with an array of debug tools (e.g. debug version of your code that traces all allocations), or even step through your code in the debugger on the system in question. — M.M, Dec 03 '15 at 23:13

Christopher Pisz · Answer 1 · 2015-12-03T23:09:54.210

No. The Windows Even Log is something you'd have to setup and use in your code, usually for a windows service.

Please show a code snippet demonstrating how you are allocating memory.

It should be practice to error check a call to malloc by checking if the value returned is NULL.

If you are not sure which call to malloc is failing, you're best bet is to invest in a good profiler. I've had a good amount of success using Intel Parallel Studio, but it isn't cheap. Also keep in mind that every profiler I've ever tried fails to work over COM boundries.

"Illegal memory access" is not necessarily a failure to allocate memory. It could be all manner of things. You need to break down your software into testable units and pinpoint the problem before worrying about how to resolve it.

Edit (after question revision): You are really limiting yourself with the constraint "We cannot alter the code"

You should begin by doing a search in every file for malloc or new and assure the result is checked.

You also have the option of turning optimization off, exporting symbols, creating a build, installing debugging tools, and remotely debugging while stepping through the code to narrow down where the trouble is. However, that is probably only an option if you have a list of steps to reproduce. Usually these kinds of memory problems are random in nature due to a bug in the code. It could just show up in one build and not another, but the bug is still there.

You can also profile remotely, but profiling an entire application or service yields near unusable results. Software should be broken down into unit testable parts and in turn into integration testable parts. If it was not, this is the price you pay (even if it wasn't your fault).

score 1 · Answer 2 · answered Dec 03 '15 at 05:08

This is a classic debug situation. You have an immediate failure (illegal memory access) and you need to work back to a root cause. If you were debugging in assembly you'd most likely see a register with an invalid memory ptr that was being used to access memory. After identifying the register you'd work your way backward to see where that register got its value from.

If it got its value from a memory allocation call then your theory might be right.

If it got its value from another register or memory location for which you don't know what its value should be, then you have an "intermediate cause". The second register or memory location was the cause of the first register having an invalid memory ptr.

You keep working your way back through intermediate causes until you find the root cause - something that is broken that someone could fix. You may have to go up the call stack a long way to find the next intermediate cause or go a long way back in a particular function. If you're unlucky the root cause may be a memory overwrite or a race condition or something else that gets in the way of what otherwise is mostly a deductive process.

If you can do source debugging (probably not if you have a third party app), you can avoid dealing with assembly language.

BTW, if you do have a third party app, chances are good there won't be anything you can do to fix the problem on your own, even if you do find a root cause. You'll likely need an update from the software vendor.

If the software is open source you do have more options. You can download the source, fix errors, and rebuild. Or you can push a fix back into the OSS project.

(I've now rewritten the question to include that information and to focus on the more answerable aspect. If it gets reopened you may want to adjust your answer accordingly, though I think it already addresses the question as rewritten fairly well.) — Harry Johnston, Dec 03 '15 at 20:35

How do I troubleshoot an illegal memory access crash that only occurs on a client's system?

2 Answers2