What's the best way of finding a heap corruption that only occurs under a performance test?

Question

The software I work (written in C++) on has a heap corruption problem at the moment. Our perf test team keep getting WER faults when the number of users logged on to the box reaches a certain threshhold but the dumps they've given me just show corruptions in inoncent areas (like when std::string frees it's underlying memory for example).

I've tried using Appverifier and this did throw up a number of issues which I've now fixed. However I'm now in the situation where the testers can load up the machine as much as possible with Appverifier and have a clean run but still get heap corruption when running without Appverifier (I guess since they can get more users on etc without). This has meant I've been unable to get a dump which actually shows the problem.

Does anyone have any other ideas for useful techniques or technologies I can use? I've done as much analysis as I can on the heap corruption dumps without appverifier but I can't see any common themes. No threads doing anything intersting at the same time as the crash, and the thread which crashes is innocent which makes me think the corruption occured some time before.

By any chance, is your code portable on *nix ? If so, fire up `valgrind` (or find a similar tool on Windows): usually the first complain about "invalid read" or "invalid write" is a good hint as where the real error is. — ereOn, Aug 04 '11 at 12:35
Ah, if only it was :-) I've used valgrind before and it's an excellent tool. Appverifier is usually pretty handy too but in this case it's not working for me :-( — Benj, Aug 04 '11 at 12:42
On another (somewhat similar) question, I recommended electric fence ported to windows. It will segfault your program intentionally on numerous memory errors, but I am uncertain if it will help in the exact problem you are facing. http://code.google.com/p/electric-fence-win32/ — San Jacinto, Aug 04 '11 at 13:00
!analyze shows the thread which faulted with an access violation but it's clear it's heap corruption because the code which the thread is currently running couldn't have caused the circumstances which the dump show. !analyze can't really help you with heap corruption sadly. — Benj, Aug 04 '11 at 16:42
!analyze can. it depends. could you paste the output if that is possible. — Sriram Subramanian, Aug 04 '11 at 17:35
In this instance !analyze is very little use, apart from anything else, the dump is 64 bit so you can't even rely on the parameters being right in the stack from "kv". You have to use techniques like this: http://analyze-v.com/?p=482 — Benj, Aug 05 '11 at 10:10

score 6 · Accepted Answer · answered Aug 04 '11 at 12:47

6

The best tool is Appverifier in combination with gFlags but there are many other solutions that may help.

For example, you could specify a heap check every 16 malloc, realloc, free, and _msize operations with the following code:

#include <crtdbg.h>
int main( )
{
int tmp;

// Get the current bits
tmp = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);

// Clear the upper 16 bits and OR in the desired freqency
tmp = (tmp & 0x0000FFFF) | _CRTDBG_CHECK_EVERY_16_DF;

// Set the new bits
_CrtSetDbgFlag(tmp);
}

answered Aug 04 '11 at 12:47

cprogrammer

5,503
3
36
56

Hmm, interesting. Haven't seen this approach before. – Benj Aug 04 '11 at 12:56
I'm guessing this only works with CRT memory functions like malloc/free? I'd hazard a guess that anything which uses VirtualAlloc or possibly even HeapAlloc wouldn't be checked. – Benj Aug 04 '11 at 16:47
You are right. CRT it's build over the Heap and it's not aware of "native" calls like HeapAlloc or VirtualAlloc – cprogrammer Aug 04 '11 at 20:01
2

You'll be pleased to know I found the issue in the end. I didn't actually use this technique in the end although I'm glad to know it. I found the issue with appverfier after screwing with the configuration of my product until it was so extreme that it showed up under less load. I'm going to mark this is the accepted answer since it's the only interesting alternative to app verifier. – Benj Aug 05 '11 at 10:03

score 3 · Answer 2 · answered Aug 04 '11 at 12:42

You have my sympathies: a very difficult problem to track down.

As you say normally these occur some time prior to the crash, generally as the result of a misbehaving write (e.g. writing to deleted memory, running off the end of an array, exceeding the allocated memory in a memcpy, etc).

In the past (on Linux, I gather you're on Windows) I've used heap-checking tools (valgrind, purify, intel inspector) but as you've observed these often affect the performance and thus obscure the bug. (You don't say whether its a multi-threaded app, or processing a variable dataset such as incoming messages).

I have also overloaded the new and delete operators to detect double deletes, but this is quite a specific situation.

If none of the available tools help, then you're on you're own and its going to be a long debugging process. The best advice for this I can offer is to work on reducing the test scenario which will reproduce it. Then attempt to reduce the amount of code being exercised, i.e. stubbing out parts of functionality. Eventually you'll zero-in on the problem, but I've seen very good guys spend 6 weeks or more tracking these down on a large application (~1.5 million LOC).

All the best.

The app is indeed multithreaded, and boy does it have alot of threads. It's essentially an agent process that get's launched for every session on a terminal services box so in theory more users doesn't mean more load, just less available resource. It's clearly some kind of concurrancy issue that's causing the problem, probably an object lifetime is incorrect somewhere and memory is being written to that's already freed or somesuch. Ho hum... — Benj, Aug 04 '11 at 16:18

score 0 · Answer 3 · answered Aug 04 '11 at 13:14

You should elaborate further on what your software actually does. Is it multi-threaded? When you talk about "number of users logged on to the box" does each user open a different instance of your software in a different session? Is your software a web service? Do instances talk to eachother (like with named pipes)?

If your error ONLY occurs at high load and does NOT occur when AppVerifier is running. The only two possibilities (without more information) that I can think of is a concurrency issue with how you've implemented multi-threading OR the test machine has a hardware issue that only manifests under heavy load (have your testers used more than one machine?).

What's the best way of finding a heap corruption that only occurs under a performance test?

3 Answers3