Debugging Stack Corruption issue

Question

I'm debugging an "Access violation" exception on a large application in C++ (Visual Studio 2015). The application is built from several libraries and the problem occurs on one of them (SystemC), although I suspect the source of the problem is elsewhere.

What I see is a function-call that corrupts the address of a member function of the caller.

m_update_phase = true;
m_prim_channel_registry->perform_update();
m_update_phase = false;

inline
void
sc_prim_channel_registry::perform_update()
{
    for( int i = m_update_last; i >= 0; -- i ) {
    m_update_array[i]->perform_update();
    }
    m_update_last = -1;
}

(These are excerpts from systemc\kernel\sc_simcontext.cpp and systemc\communication\sc_prim_channel.h, part of the SystemC library)

The error happens after several iterations through this code above. The call to m_prim_channel_registry->perform_update() throws 0xC0000005: Access violation writing location 0x0F4CD9E9. exception.
This happens only when building the application in Release configuration.

Looking at the assembly code, I see that that the function sc_prim_channel_registry::perform_update() was inlined, and the inner function call m_update_array[i]->perform_update() seems to corrupt the stack frame of the calling function.
When the m_update_last = -1; is executed, &m_update_last no longer points to a valid memory location and the exception is thrown.
(m_update_last is a simple native member of class sc_prim_channel_registry with type int)

    m_update_phase = true;
    m_prim_channel_registry->perform_update();
1034D99E  mov         eax,dword ptr [esi+10h]  
1034D9A1  mov         byte ptr [esi+0A3h],1  
1034D9A8  mov         dword ptr [ebp-18h],eax  
1034D9AB  mov         ebx,dword ptr [eax+28h]  
1034D9AE  test        ebx,ebx  
1034D9B0  js          $LN163+0FEh (1034D9D0h)  
1034D9B2  mov         esi,eax  
1034D9B4  mov         eax,dword ptr [esi+20h]  
1034D9B7  mov         edi,dword ptr [eax+ebx*4]  
1034D9BA  mov         ecx,edi  
1034D9BC  mov         eax,dword ptr [edi]  
1034D9BE  call        dword ptr [eax+14h]  
1034D9C1  sub         ebx,1  
1034D9C4  mov         byte ptr [edi+1Ch],0  
1034D9C8  jns         $LN163+0E2h (1034D9B4h)  
1034D9CA  mov         esi,dword ptr [this]  
1034D9CD  mov         eax,dword ptr [ebp-18h]  
1034D9D0  mov         dword ptr [eax+28h],0FFFFFFFFh  
    m_update_phase = false;

The exception is thrown at address 1034D9D0 So the last instructions being executed are

0F97D9CD  mov         eax,dword ptr [ebp-18h]  
0F97D9D0  mov         dword ptr [eax+28h],0FFFFFFFFh

m_prim_channel_registry address is in [ebp-18h] and eax, and [eax+28h] is m_update_last.

Looking in the watch window at esp and ebp before the inner call perform_update(), I see that:

    ebp-18h 0x0022fd5c  unsigned int
    esp 0x0022fd60  unsigned int

This is strange. The difference between them is only 4 bytes and the next push to the stack will make them equal and overwrite [ebp-18h]!
[ebp-18h] holds a copy of this->m_prim_channel_registry. The call 1034D9BE call dword ptr [eax+14h], as it pushes the stack, corrupts the contents of ebp-18h. It looks like the stack has grown (downwards) too much, and corrupts the previous frame.

My questions are:

Am I analyzing the issue correctly? Did I miss something here?
What could cause such a corruption? I assume the issue is not related to either the compiler or the SystemC library, probably something that happened earlier someplace else.
What are the techniques for debugging such a corruption?

Update

I believe I found the problem, but I can't say I understand this completely.
In the same function (sc_simcontext::crunch) where the external perform_update() is invoked, systemc methods are invoked:

    // execute method processes

    sc_method_handle method_h = pop_runnable_method();
    while( method_h != 0 ) {
    try {
        method_h->execute();
    }
    catch( const sc_exception& ex ) {
        cout << "\n" << ex.what() << endl;
        m_error = true;
        return;
    }
    method_h = pop_runnable_method();
    }

These methods are deferred function calls registered earlier.
One of these methods was returning by executing ret 4 thus shrinking the stack frame every time it was called, to the point where the corruption described above happened.

And how did I manage registering a corrupted systemc method?
Apparently it's a bad idea using SC_METHOD(f) when f is a virtual function of the module. Doing that caused a different, unrelated "random" function to be called.
I'm not exactly sure why it happens this way and why this limitation exists. Also I don't remember seeing any warning about using virtual member functions as systemc methods, however it was definitely the problem. When debugging the method registration in the SC_METHOD call itself I could see the function pointer inside pointing to a different function than was given to the SC_METHOD macro.

To fix the problem I called SC_METHOD(wrapper_f), where wrapper_f is a simple non virtual member function of the module, that all it does is calling f, the original virtual function. That's it.

With this kind of code, sometimes there are problems with the _reentry_. That is, the virtual **perform_update()** sometimes modifies (adds or removes a value) the **m_update_array** while the previous stack frame is looping through it, and you have undefined results. You can debug this easily with a log file. — rodrigo, Jan 23 '17 at 13:12

score 3 · Answer 1 · 2017-07-21T15:13:12.787

You are probably having issues with member function pointers on MSVC.

Consider following code, file main.cpp:

#include <cstdio>

struct base;
typedef void (base::*baseptr_t)();

struct base {
    void foo() { }
};

void callfoo(base *obj, baseptr_t ptr);

int main()
{
    base obj;
    std::printf("sizeof(baseptr_t)=%llu\n", sizeof(baseptr_t));
    callfoo(&obj, &base::foo);
}

and file callfoo.cpp:

#include <cstdio>

struct base;
typedef void (base::*baseptr_t)();

void callfoo(base *obj, baseptr_t ptr)
{
    std::printf("sizeof(baseptr_t)=%llu\n", sizeof(baseptr_t));
    (obj->*ptr)();
}

On x86_64 this prints:

sizeof(baseptr_t)=8
sizeof(baseptr_t)=24

before crashing with access violation.

This is because MSVC generates 8-byte pointers for classes with known definition, but has to generate 24-byte pointers if class definition is not available.

Compiler has ways to control this behavior:

PS: I wasn't able to reproduce this, but you can also check sc_process.h header from SystemC, it has following lines:

#if defined(_MSC_VER)
#if ( _MSC_VER > 1200 )
#   define SC_USE_MEMBER_FUNC_PTR
#endif
#else
#   define SC_USE_MEMBER_FUNC_PTR
#endif

You can try to undefined this macro for your build, in this case SystemC will try to use different strategy when calling process function.

PS2: Member function pointer size can be 8, 16 and 24 bytes in size depending on its hierarchy, so there should be 3 ways to dereference member function pointer, plus each way has to handle virtual and non-virtual calls.

That's interesting, but I'm not sure it's related to my specific issue. The problem I was seeing disappeared once the function was no longer virtual, not when the class lack definition. Any idea how this could be related to *virtual*? — Amir Gonnen, Jul 21 '17 at 14:38
For 8-byte case and non-virtual function member function pointer is a plain pointer to a function, but if you take pointer to virtual function it generates small stub function that performs virtual call and returns its address instead. With 24-byte variant - I am not sure, but it most likely stores index of function in vtable, which makes it possible to call another virtual function from this class (or virtual function of some other class) if garbage is read instead of index. Now if function is non-virtual, it might happen to work by accident because code that performs call is much simpler. — , Jul 21 '17 at 14:58
This is very likely the root cause of the issue. According to the SystemC `INSTALL` documentation, `/vmg` is required for MSVC. — pah, May 22 '19 at 18:22

score 0 · Answer 2 · answered Jan 23 '17 at 08:42

It seems you know what you are doing.

I can give you an advice, not a solution, but it is something that I encountered more than a few times, that corrupts the stack.

Check the the function causing the corruption, perform_update(). Does it defines a big array as a local variable? If so, it probably exceeds the stack and overrides the return data and other important data there. This is the most common problem I encountered for stack corruption.

It is a sneaky problem because it depends on the size of the local array and the amount of stack you have. This changes from system to system.

Thanks for the advice. `perform_update()` causes the corruption before it starts executing. The call itself causes the stack corruption. `perform_update` is a function pointer which points to `sc_fifo::update()` which is pretty simple SystemC function, no local variables at all. I don't think the problem is there. — Amir Gonnen, Jan 23 '17 at 09:26

Debugging Stack Corruption issue

What I see is a function-call that corrupts the address of a member function of the caller.

My questions are:

Update

2 Answers2