Fast capture stack trace on windows / 64-bit / mixed mode

Question

Like most of you probably know there exists plenty of different mechanisms to walk stack traces, starting from windows api, and continuing further into depth of magical assembly world - let me list here some of links I've already studied.

For of all let me mention that I want to have mechanism of memory leak analysis of mixed mode (managed and unmanaged) / 64-bit + AnyCPU application and from all windows api's CaptureStackBackTrace is most suitable for my needs, but as I have analyzed - it does not support managed code stack walking. But that function API is closest to what I need (Since it also calculates back trace hash - unique identifier of particular call stack).

I've excluded different approaches of locating memory leaks - most of software which I have tried is either crashing or does not work unreliably, or gives bad results.

Also I don't want to recompile existing software and override malloc / new other mechanism - because it's heavy task (we have huge code base with a lot of dll's). Also I suspect this is not one time job which I need to perform - issue comes back with 1-2 year cycle, depending of who and what was coding, so I would prefer to have built-in memory leaks detection in application itself (memory api hooking) instead of fighting with this problem over and over again.

http://www.codeproject.com/Articles/11132/Walking-the-callstack

Uses StackWalk64 windows API function, but does not work with managed code. Also 64 bit support is not fully clear to me - I have seen some walkaround for 64 bit problem - I suspect that this code does not fully work when stack walk is done within same thread.

Then there exists process hacker: http://processhacker.sourceforge.net/

Which also use StackWalk64, but extends it's call back function (7th and 8-th parameters) to support mixed mode stack walking. After a lot of complexities with 7/8 call back functions I've managed to reach also support of StackWalk64 with mixed mode support (Catching stack trace as vector - where each pointer refers to assembly / dll location where call went by). But as you may guess - the performance of StackWalk64 is insufficient for my needs - even with simple Message box from C# side, application simply "hangs" for a while until it starts up correctly.

I haven't seen such heavy delays with CaptureStackBackTrace function call, so I assume that performance of StackWalk64 is insufficient for my needs.

There exists also COM based approach of stack trace determination - like this: http://www.codeproject.com/Articles/371137/A-Mixed-Mode-Stackwalk-with-the-IDebugClient-Inter

http://blog.steveniemitz.com/building-a-mixed-mode-stack-walker-part-1/

but what I'm afraid of - it requires COM , and thread needs to be com initialized, and due to memory api hooking I should not touch com state within any thread, because it can results in heavier problems (e.g. incorrect apartment initialization, other glitches)

Now I've reached the point when windows API becomes insufficient for my own needs, and I need to walk through call stack manually. Such examples can be found for example:

http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks See function FillStackInfo / 32 bit only, does not support managed code.

There are couple of mentions about reversing stack trace - for example on following links:

Especially 1, 3, 4 links give some interesting night reading. :-)

But even thus they are rather interesting mechanisms, there is no fully working demo example on any of them.

I guess one of examples is Wine's dbghelp implementation (Windows "emulator" for linux), which also shows how exactly StackWalk64 works at the end, but I suspect that it's heavily bound to DWARF2 file format executable, so it's not identical to current windows PE executable file format.

Can someone pintpoint me to good implementation of stack walking, working on 64-bit architecture, with mixed mode support (can track native and managed memory allocations), which is bound purely in register / call stack / code analysis. (Combined implementations of 1, 3, 4)

Does someone has any good contact from Microsoft development team, who could potentially answer to this question ?

TarmoPikaro · Answer 1 · 2016-01-04T21:55:26.957

Just notes to myself:

Apparently CaptureStackBackTrace is probably calls directly or indirectly RtlCaptureStackBackTrace, and source code of that function is apparently is open source code currently - can be searched using "windows research kernel".

Code which I've accidentally found by harvesting https://github.com/dotnet/coreclr/blob/master/src/unwinder/amd64/unwinder_amd64.cpp

where there was reference in code that's borrowed from windows kernel:

Everything below is borrowed from minkernel\ntos\rtl\amd64\exdsptch.c file from Windows

and by googling bit more I've located windows kernel itself.

May be I can upgrade that function to support managed stack as well (using information from process hacker).

[4.1.2015] By analyzing deeper it looks like main performance bottleneck is not CaptureStackBackTrace itself - because it's simple iteration, structure lookup, but managed mode stack walk, where I call C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll / OutOfProcessFunctionTableCallback - you can find it's source code in .net distribution, and apparently its allocating memory for analyzing JIT compiled structures. But problem is that JIT compilation can change whenever, and only way to have reliable stack trace is to re-query same information over and over again, which can cause overhead in memory allocation. I guess code needs to be changed so that mscordacwks similar code will not allocate memory by itself, but use run-time structures to determine call stack and function table / function entries.

P.S. if you vote down this answer, I would like to know reason why, what is the alternative. And better if you have tried alternative by yourself.

Interesting I have always wanted to have a definitive source how the unwinder works in x64. It has even the code for Unwind Info version 2 which allows to describe the epilog. — Alois Kraus, Dec 31 '15 at 10:55

score 3 · Accepted Answer · edited May 23 '17 at 12:25

9-1-2015 - I've located original function which gets called by process hacker, and that one was

C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll OutOfProcessFunctionTableCallback

it's source code - which was here: https://github.com/dotnet/coreclr/blob/master/src/debug/daccess/fntableaccess.cpp

From there I have owner of most of changes in that source code - Jan Kotas (jkotas@microsoft.com) and contacted him about this problem.

From: Jan Kotas <jkotas@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com> 
Sent: Friday, January 8, 2016 3:27 PM
Subject: RE: Fast capture stack trace on windows 64 bit / mixed mode...

...

The mscordacwks.dll is called mscordaccore.dll in CoreCLR / github repro. The VS project 
files are auto-generated for it during the build 
(\coreclr\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\mscordaccore.vcxproj).
You should be able to build and debug CoreCLR to understand how it works.
...

From: Jan Kotas <jkotas@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com> 
Sent: Saturday, January 9, 2016 2:02 AM
Subject: RE: Fast capture stack trace on windows 64 bit / mixed mode...

> I've tried to replace 
> C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll dll loading 
> with C:\Prototyping\dotNet\coreclr-master\bin\obj\Windows_NT.x64.Debug\src\dlls\mscordac\Debug\mscordaccore.dll
> loading (just compiled), but if previously I could get mixed mode stack trace correctly:
> ...

mscordacwks.dll is tightly coupled with the runtime. You cannot mix and match them between runtimes.
What I meant is that you can use CoreCLR to understand how this works.

But then he recommended this solution which was working for me:

int CaptureStackBackTrace3(int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash)
{
    CONTEXT ContextRecord;
    RtlCaptureContext(&ContextRecord);

    UINT iFrame;
    for (iFrame = 0; iFrame < nFrames; iFrame++)
    {
        DWORD64 ImageBase;
        PRUNTIME_FUNCTION pFunctionEntry = RtlLookupFunctionEntry(ContextRecord.Rip, &ImageBase, NULL);

        if (pFunctionEntry == NULL)
            break;

        PVOID HandlerData;
        DWORD64 EstablisherFrame;
        RtlVirtualUnwind(UNW_FLAG_NHANDLER,
            ImageBase,
            ContextRecord.Rip,
            pFunctionEntry,
            &ContextRecord,
            &HandlerData,
            &EstablisherFrame,
            NULL);

        BackTrace[iFrame] = (PVOID)ContextRecord.Rip;
    }

    return iFrame;
}

This code snipet still is missing backtrace hash calculation, but this is something can can be added afterwards.

It's very import also to note that when debugging this code snipet you should use native debugging, not mixed mode (C# project by default use mixed mode), because it somehow disturbs stack trace in debugger. (Something to figure out how and why such distortion happens)

There is still one missing piece of puzzle - how to make symbol resolution fully resistant to FreeLibrary / Jit code dispose, but this is something I need to figure out still.

Please note that RtlVirtualUnwind will most probably work only on 64-bit architecture, not on arm or 32-bit.

One more funny thing is that there exists function RtlCaptureStackBackTrace which somehow resembles windows api function CaptureStackBackTrace - but they somehow differ - at least by naming. Also if you check RtlCaptureStackBackTrace - it calls eventually RtlVirtualUnwind - you can check it from Windows Research Kernel source codes

RtlCaptureStackBackTrace
>
RtlWalkFrameChain
>
RtlpWalkFrameChain
>
RtlVirtualUnwind

But what I have tested RtlCaptureStackBackTrace does not works correctly. Unlike function RtlVirtualUnwind above.

It's a kinda magic. :-)

I'll continue this questionnaire with phase 2 question - in here:

Resolve managed and native stack trace - which API to use?

There is a blog Post from Naveen which could be of interest to You http://www.nynaeve.net/?p=99 it deals with some not well documented issues of RtlVirtualUnwind You might run into. — Alois Kraus, Jan 10 '16 at 09:50
Typically developer does not know anything about stack walking - at least me. Yes, I know assembly, and more or less I can figure out which registers are pushed and why, but in order to have fully functional code you need to use already created api and not to reinvent the wheel. (Or at least reinvent so that you fully understand what you're doing) So I was sitting in google prompt and trying to figure out how to find the solution. Of course "track trace" or "stack frame", "walking" are first words which I have used for google. But it's difficult to locate "unwind" at the end. — TarmoPikaro, Jan 12 '16 at 20:53

Alois Kraus · Answer 3 · 2015-12-30T18:30:39.243

x64 Stack walking is complicated as you have already found out. A simple alternative is to simply not do it but leave the hard things to the OS ETW stackwalker. That works and it is much faster than you will ever get.

You can take advantage of it by emitting your own ETW event. Before that you need to start an ETW session for your event provider and enable stack walking for your provider. There is a catch on Windows 7 where it does not work unless the managed stack frames are all NGenned because the x64 ETW Stackwalker will stop if he finds a stack frame which is not in any loaded module which is true for JITed code.

Starting with Windows 8 the ETW Stackwalker will always walk the first MB of the stack for stack frames which fixes the JIT problem. The JIT compiler emits Unwind Infos for the generated code if ETW tracing is on and registers it via RtlAddGrowableFunctionTable which makes it possible to walk the stack fast from within the kernel in the first place. Things work differently when ETW tracing is not enabled for compatibility reasons.

If you are after malloc/free new/delete memory leaks you can also use the OS bultin capabilities of Heap allocation tracing which already exists since Windows 7. See xperf -help start and https://randomascii.wordpress.com/2015/04/27/etw-heap-tracingevery-allocation-recorded/ for more information about heap allocation tracing. You can enable it for an already running process with no problems. The downside is that for any real world application the generated data is huge. But if you are after big allocations only then it can help to track only VirtualAlloc calls which can also be enabled with minimal overhead.

Managed code since .NET 4.5 has also its own ETW allocation tracing provider with full stack walking even on x64 Windows 7 because it does a full managed stackwalk by itself. More infos can be found in the CoreClr Sources at: ETW::SamplingLog::SendStackTrace in https://github.com/dotnet/coreclr/blob/master/src/inc/eventtracebase.h for many more details.

That is only a rough outline what is possible. To really get all of the necessary details would take a whole book I fear. And I am still learning new things every day.

Here is a heapalloc.cmd script which you can use to track heap allocations. By default it records into a 500MB ring buffer if your leak builds up over longer periods of time recording all allocation stacks without condensing them at runtime will not work out with WPA. But you could post process a huge ETL file and write your own viewer for it.

@echo off 
setlocal enabledelayedexpansion
REM consider using a different drive for ETL output to prevent slowing down 
REM your application and to prevent lost buffers
set OUTDIR=C:\TEMP
set OUTFILENAME=HeapTracing.etl
REM Final output file
set OUTFILE=!OUTDIR!\!OUTFILENAME!
set CLRUNDOWNFILE=!OUTDIR!\clr_HeapDCend.etl
set KERNELFILE=!OUTDIR!\kernel.etl
set CLRSESSIONFILE=!OUTDIR!\clrHeapSession.etl
set HEAPUSERFILE=!OUTDIR!\HeapUserSession.etl
REM Default is allocation and realloc to track memory leaks
REM HeapFree is the other option to track double free calls
set HEAPTRACINGFLAGS=HeapAlloc+HeapRealloc 

if "%3" NEQ "" (
echo Overriding Heap Tracing Flags with: %3
set HEAPTRACINGFLAGS=%3
)


if "%1" EQU "-start" ( 
    call :StartTracing -PidNewProcess %2
    goto :Exit 
) 

if "%1" EQU "-attachPid" ( 
    call :StartTracing -Pids %2
    goto :Exit 
) 

if "%1" EQU "-startNext" (
    reg add "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /t REG_DWORD /d 1 /f
    if not %errorlevel% == 0 goto failure
    call :StartTracing -Pids 0
    goto :Exit
)

if "%1" EQU "-stop" ( 
    set XPERF_CreateNGenPdbs=1
    xperf -start ClrRundownSession -on e13c0d23-ccbc-4e12-931b-d9cc2eee27e4:0x118:5+a669021c-c450-4609-a035-5af59af4df18:0x118:5 -f "!CLRUNDOWNFILE!" -buffersize 256 -minbuffers 256 -maxbuffers 512 
    call :WaitUntilRundownCompleted "!CLRUNDOWNFILE!"
    xperf -stop -stop ClrSession ClrRundownSession HeapSession | findstr /V identifiable 2> NUL

    echo Merging profiles
    REM Reset symbol path to create the pdbs files in the output directory with in the directory with the same name like our etl file
    set TMPSYMBOLPATH=!_NT_SYMBOL_PATH!
    REM Each tool is using a different pdb cache folder. If you are using them side by side 
    REM you have to wait a long time to refresh the pdb cache. To spare the waiting time we use 
    REM the pdb cache folder from WPR

    mkdir C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 2> NUL
    set _NT_SYMBOL_PATH=srv*C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache 
    mklink /D "!OUTFILE!.NGENPDB" C:\ProgramData\WindowsPerformanceRecorder\NGenPdbs_Cache  2> NUL

    echo Managed PDBs are stored at: !OUTFILE!.NGENPDB. If you want to transfer the etl do not forget to copy this directory with the pdbs as well. 
    echo Merging ETL files and generating native pdbs

    xperf -merge  "!KERNELFILE!" "!CLRSESSIONFILE!" "!CLRUNDOWNFILE!" "!HEAPUSERFILE!" "!OUTFILE!"
    set _NT_SYMBOL_PATH=!TMPSYMBOLPATH!
    echo !OUTFILE! was created

    if "%2" NEQ "" reg delete "HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\%~nx2" /v TracingFlags /f 2> NUL
    goto :Exit 
) 

goto Usage:

:StartTracing
xperf -start ClrSession -on  Microsoft-Windows-DotNETRuntime:5 -f "!CLRSESSIONFILE!" -buffersize 128 -minbuffers 256 -maxbuffers 512 
xperf -on PROC_THREAD+LOADER+latency+virt_alloc -stackwalk VirtualAlloc  -f "%KERNELFILE%"
xperf -start HeapSession -heap %1 %2 -BufferSize 1024 -MinBuffers 128 -MaxBuffers 1024 -stackwalk %HEAPTRACINGFLAGS% -f "!HEAPUSERFILE!" -FileMode Circular -MaxFile 1024
exit /B

REM Wait until writing to ETL file has stopped by checking its file size
:WaitUntilRundownCompleted
:StillWriting
    for %%F in (%1) do set "size=%%~zF"
    timeout /T 1  > nul
    for %%F in (%1) do set "size2=%%~zF"
    if "!size!" EQU "" goto :EndWriting
    if "!size!" NEQ "!size2!" goto StillWriting
:EndWriting
timeout /T 1  > nul
exit /B


:Usage
    echo Usage: 
    echo HeapAlloc.cmd -start [executable] or -stop
    echo               -start [executable] Start a trace session 
    echo               -startNext [executable] Start heap tracing for all subsequent calls to executable.
    echo               -attachPid ddd Start a trace session for specified process
    echo               -stop  [executable] Stop a trace session 
    echo Examples
    echo     HeapAlloc.cmd -startNext devenv.exe
    echo     HeapAlloc.cmd -stop      devenv.exe
    echo To attach to a running process
    echo     HeapAlloc.cmd -attachPid dddd
    echo     HeapAlloc.cmd -stop 
    echo You must call -stop for your executable if you have used -start or startNext because heap allocation tracing will enabled until you stop it!
goto :Exit 

:failure
    echo Error occured
goto :Exit

:Exit

I have briefly tried out ETW - looks like something I need, but I did not managed to get out call stack versus allocated size out of those graphs - does they exists ? What I'm afraid of is that this is another tool - let's check how it works, you spend 2-3 hours trying this and that without any clear result. Do I need to write some ETW event handler in order to monitor heap allocations ? — TarmoPikaro, Dec 30 '15 at 12:51
I'm currently using mechanism similar to: http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks but I have replaced detours with minhooks (detours is commercial library) http://www.codeproject.com/Articles/44326/MinHook-The-Minimalistic-x-x-API-Hooking-Libra So even now I can collect all non-freed allocations and sort them from largest to smallest consumer (extract top 10), but only thing what I'm missing is reliable mechanism of getting call stack. — TarmoPikaro, Dec 30 '15 at 12:52
Yes ETW allows you to get the call stacks for the not yet freed objects. No you do not need to write any code. See updated answer with a sample script how it can be done. — Alois Kraus, Dec 30 '15 at 18:26
Tried bit more - even thus I can see some leaks - at least some familiar dll names, but call stack is not full - at least I would like to know to which address in dll it landed - I can even reverse engineere guilty function, but now there is not sufficient information about who leaked out ( at least starting from dll which called malloc / new / gcnew function ). So xperf is still useless to me. May be there is some command line argument missing thus ? Produced trace file however is huge - over 2 Gb with rather small 3d models which I have here. Also a lot of time takes merging information. — TarmoPikaro, Dec 30 '15 at 23:43
My own tracing application can display top 10 leakers immediately and it does not slows down application anyhow - in same manner as xperf. Also from future development perspective it makes sense to have mixed mode stack trace function to be built in application. If crash or exception occurs - it can be diagnosed without any debugging effort. But I guess my own implementation of StackWalk64 will do the trick as well for this, even thus it's bit slow. — TarmoPikaro, Dec 30 '15 at 23:48
Used command lines: heapalloc.cmd -attachPid 3132 and then: heapalloc.cmd -stop — TarmoPikaro, Dec 30 '15 at 23:52
If you are using it on Win7 x64 not with full NGen then the stack will stop at the first jited method frame. I have written this in my answer already. Is this the reason for your broken stacks or something else? — Alois Kraus, Dec 31 '15 at 10:49
Probably, I have windows 7, 64 bit / .net 4. So do I need to upgrade OS to windows 8 ? Also may be .net framework version ? — TarmoPikaro, Dec 31 '15 at 14:34
Windows >8 would definitely help a lot. Also an upgrade to the newest .NET Framework version is a good idea since the ETW support of .NET has become dramatically better since .NET 4.0. For stackwalking you need correctly generated pdbs out of your ngen images which could have issues in older .NET versions. If you like Win7 you definitely need to call c:\Windows\microsoft.net\Framework64\v4.0.30319\ngen.exe install allofyourdlls. That should fix the stack issue under Win7 if you have a recent .NET Framework (4.5 or better) installed. — Alois Kraus, Jan 02 '16 at 07:20
Upgrading OS or .net is quite normally non-trivial thing to do. It would be better to know exactly what needs to be upgraded and where is the bug. I suspect that xperf is not working correctly, and there is no clear evidence that this is not true. I have .pdb generated correctly for all dll's and exe's. Also if the bug is inside os or .net, this can be reported as bug to Microsoft and probably fixed also in older .net framework as bugfix ? — TarmoPikaro, Jan 02 '16 at 22:16
Good luck with getting a backport. What exactly can be done with which .net and Windows version is beyond SO. Consult someone who knows the answers e.g. MS support — Alois Kraus, Jan 02 '16 at 22:33
I finally made up my own memory leak detection. After starting to finilize project, I've decided to cross check 32-bit compilability - and I did not have same function as in this thread for 32-bit system - I've replaced CaptureStackBackTrace - and noticed that it does not capture full stack - same problem as with xperf. So it looks like xperf is suffering from same problem with unreliable call stack determination, as written within this question. Someone route this Q/A to xperf developers ? :-) — TarmoPikaro, Jan 25 '16 at 15:14
Xperf for 32 bit works since win 7. But you need to use WPA of Windows SDK 8.1 or greater. 8.0 had a bug in it which was fixed in 8.1 SDK. If you use Win 10 SDK you need to use Update 1 SDK because xperf forgets to collect the loaded modules during trace stop on Windows 7. It really looks like MS has no regression tests for Win 7 since a long time. — Alois Kraus, Jan 25 '16 at 19:50
So finally I've polished my code and created 32/64/mixed mode memory leak detection - I've published over here: http://www.saunalahti.fi/~tarmpika/diagnostic/ May be you can try and compare how it looks like compared to xperf ? Bug reports can be sent to me directly, I'll fix them if you find any. — TarmoPikaro, Feb 12 '16 at 16:30

score 2 · Answer 4 · answered Jan 25 '16 at 20:36

25.1.2016 Writing as separate issue, as complementary information.

For stack unique id CaptureStackBackTrace uses simple sum of all instruction pointers - idea is borrowed from: "Windows_Research_Kernel(sources)\WRK-v1.2\base\ntos\rtl\amd64\stkwalk.c":

    size_t hashValue = 0;

    for (int i = 0; i < nFrames; i++)
        hashValue += PtrToUlong(BackTrace[i]);

    *pBackTraceHash = (DWORD)hashValue;

I'm not sure about last conversion - some specify last parameter as DWORD, some as ulong64, but it's not relevant. The main problem with this calculation is that it's not unique enough. For case of recursive function calls - if you have call order:

func1
func2
func3

Stack trace for:

func1
func3
func2

Will be identical.

What I have debugged - for memory leak detection I'm getting 62876 false hits - unique stack id calculation is not reliable enough.

I've bit switched formula to:

static DWORD crc32_tab[] =
{
    0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
    0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
    0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
    0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
    0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
    0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
    0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
    0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
    0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
    0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
    0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
    0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
    0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
    0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
    0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
    0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
    0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
    0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
    0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
    0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
    0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
    0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
    0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
    0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
    0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
    0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
    0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
    0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
    0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
    0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
    0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
    0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
    0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
    0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
    0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
    0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
    0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
    0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
    0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
    0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
    0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
    0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
    0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};

if (pBackTraceHash)
{

    size_t hashValue = 0;
    for( int idxFrame = 0; idxFrame < (int)iFrame; idxFrame++ )
    {
        unsigned char* p = (unsigned char*)&BackTrace[idxFrame];
        for( int i = 0; i < sizeof(void*); i++ )
            hashValue = crc32_tab[ ((hashValue ^ *p++) & 0xFF) ] ^ (hashValue >> 8);
    }
    *pBackTraceHash = (DWORD)hashValue;
}

This algorithm does not give false hits, but slows down execution a little bit.

Also memory leak statistics differs: Unreliable algorithm: Total amount of leaked memory: 48'874'764 / in 371 allocation pools Crc32 based algorithm: Total amount of leaked memory: 48'874'764 / in 614 allocation pools

Like you see - statistics combines (pools) similar call stack together - less fragmentation, but original call stack is lost. (Incorrect statistics)

May be someone could give me some faster algorithm for this ?

score 1 · Answer 5 · answered Jan 17 '16 at 10:42

1

Btw - if anyone is missing original implementation of StackWalk for windows, it's located over here:

https://github.com/dotnet/coreclr/blob/master/src/utilcode/stacktrace.cpp

answered Jan 17 '16 at 10:42

TarmoPikaro

4,723
2
50
62

score 1 · Answer 6 · answered Jan 26 '16 at 22:44

27.1.2016 And may be out of direct question - is 32-bit call stack determination. I've asked about which API to use - at least CaptureStackBackTrace produces incomplete traversals (Only native code), and also RtlVirtualUnwind api function does not exists for 32-bit windows.

From: Noah Falk <noahfalk@microsoft.com>
To: Tarmo Pikaro <tapika@yahoo.com>; Mike McLaughlin <mikem@microsoft.com> 
Cc: Jan Kotas <jkotas@microsoft.com>
Sent: Tuesday, January 26, 2016 1:34 AM
Subject: RE: Resolving managed call stack from void*

Hi Tarmo, hope the exploration of stackwalking has been interesting. 
If I followed you correctly you’ve been successful on x64 but hoping you can extend your technique to 32 bit. 
Indeed the RtlCaptureVirtualUnwind techniques don’t work here, and the fundamental reason behind it is that 
while x64 defines a specific calling convention that all code on Windows is forced to use, x86 does not. 
This means that there is no algorithm the OS could implement which guarantees correct unwinding when PDBs are 
unavailable. However you do have some options:

1)      You can use simple heuristics that work for certain kinds of code. 
Unoptimzed code on x86 often uses EBP chaining, in which ESP in the current frame points to EBP, and EBP points 
to the parent frame’s EBP, and so on down the stack. The return address is stored on the stack adjacent to EBP. 
As I recall all jitted code produced by recent versions of .Net follows these conventions, including optimized 
jitted code. However when a compiler performs inlining these conventions will be unable to detect it, and optimized 
code that does not follow this convention could easily cause the stack to become unwalkable.

2)      If you are willing to load PDBs you can use the DIA APIs to walk the stack: 
https://msdn.microsoft.com/en-us/library/dt06fh94.aspx. The PDB contains additional data about optimized code 
which allows frames that do not follow the EBP chaining convention to be correctly unwound. 
This is the stack walk API that Visual Studio is using when it debugs 32 bit native code on Windows.

3)      The ICorDebug APIs (https://msdn.microsoft.com/en-us/library/dd646502(v=vs.110).aspx) are a set of 
APIs that are designed to support managed code debuggers. Starting in .Net 4.0 the ICorDebug API supports 
dump debugging, however the API is designed in such a way that you don’t have to serialize a dump file. 
This is likely to be more complicated than you would want, but its supported to the use the Windows process 
snapshot APIs to take a snapshot of the memory space and then direct the ICorDebug API to read from this 
snapshot as if it was a dump. One advantage of the ICorDebug API is that not only will it give you managed 
stack frames, it also allows exporing all the other kinds of data debuggers would expose such as parameters, 
local values, fields of objects, types of the values, etc.

The MDbg tool (https://www.microsoft.com/en-us/download/details.aspx?id=2282) is a complete sample debugger 
with source included. It supports dump debugging and displaying callstacks, though it won’t have any specific 
example about using the process snapshot APIs in place of using a dump. The main change would be replacing 
the implementation of ICorDebugDataTarget. MDbg has an implementation that reads from a dump file and you 
would need to create a new implementation that reads from a process snapshot using the windows APIs 
(https://msdn.microsoft.com/en-us/library/dn457825(v=vs.85).aspx). I’ve never written the code myself and 
I’ve heard from other tool authors that they found using the windows snapshot APIs more difficult than expected,
 but eventually they were successful.

And I was bit inspired by approach 1, since already saw similar approach being done in another project, so I've wrote my own implementation for 32-bit stack traversal:

int CaptureStackBackTracePro( int FramesToSkip, int nFrames, PVOID* BackTrace, PDWORD pBackTraceHash )
{
    //
    //  This approach was taken from StackInfoManager.cpp / FillStackInfo
    //  http://www.codeproject.com/Articles/11221/Easy-Detection-of-Memory-Leaks
    //  - slightly simplified the function itself.
    //
    int regEBP;
    __asm mov regEBP, ebp;

    long *pFrame = (long*) regEBP;              // pointer to current function frame
    void* pNextInstruction;
    int iFrame = 0;

    //
    // Using __try/_catch is faster than using ReadProcessMemory or VirtualProtect.
    // We return whatever frames we have collected so far after exception was encountered.
    //
    __try {
        for( ; iFrame < nFrames; iFrame++ )
        {
            pNextInstruction = (void*)(*(pFrame + 1));

            if( !pNextInstruction )     // Last frame
                break;

            BackTrace[iFrame] = pNextInstruction;
            pFrame = (long*)(*pFrame);
        }
    }
    __except(EXCEPTION_EXECUTE_HANDLER) 
    {
    }

    // pBackTraceHash fillout is missing, see in another answer code snipet.

    return iFrame;

} //CaptureStackBackTracePro

Brief tests indicate that this function is able to capture native and managed stack frames.

Optimized code I guess requires bit deeper analysis. Better to leave out optimizations or only optimize relevant parts of code - for better diagnostics ?!

Fast capture stack trace on windows / 64-bit / mixed mode

6 Answers6

Linked