I'm debugging an "Access violation" exception on a large application in C++ (Visual Studio 2015). The application is built from several libraries and the problem occurs on one of them (SystemC), although I suspect the source of the problem is elsewhere.
What I see is a function-call that corrupts the address of a member function of the caller.
m_update_phase = true;
m_prim_channel_registry->perform_update();
m_update_phase = false;
inline
void
sc_prim_channel_registry::perform_update()
{
for( int i = m_update_last; i >= 0; -- i ) {
m_update_array[i]->perform_update();
}
m_update_last = -1;
}
(These are excerpts from systemc\kernel\sc_simcontext.cpp
and systemc\communication\sc_prim_channel.h
, part of the SystemC library)
The error happens after several iterations through this code above. The call to m_prim_channel_registry->perform_update()
throws 0xC0000005: Access violation writing location 0x0F4CD9E9.
exception.
This happens only when building the application in Release configuration.
Looking at the assembly code, I see that that the function sc_prim_channel_registry::perform_update()
was inlined, and the inner function call m_update_array[i]->perform_update()
seems to corrupt the stack frame of the calling function.
When the m_update_last = -1;
is executed, &m_update_last no longer points to a valid memory location and the exception is thrown.
(m_update_last
is a simple native member of class sc_prim_channel_registry
with type int
)
m_update_phase = true;
m_prim_channel_registry->perform_update();
1034D99E mov eax,dword ptr [esi+10h]
1034D9A1 mov byte ptr [esi+0A3h],1
1034D9A8 mov dword ptr [ebp-18h],eax
1034D9AB mov ebx,dword ptr [eax+28h]
1034D9AE test ebx,ebx
1034D9B0 js $LN163+0FEh (1034D9D0h)
1034D9B2 mov esi,eax
1034D9B4 mov eax,dword ptr [esi+20h]
1034D9B7 mov edi,dword ptr [eax+ebx*4]
1034D9BA mov ecx,edi
1034D9BC mov eax,dword ptr [edi]
1034D9BE call dword ptr [eax+14h]
1034D9C1 sub ebx,1
1034D9C4 mov byte ptr [edi+1Ch],0
1034D9C8 jns $LN163+0E2h (1034D9B4h)
1034D9CA mov esi,dword ptr [this]
1034D9CD mov eax,dword ptr [ebp-18h]
1034D9D0 mov dword ptr [eax+28h],0FFFFFFFFh
m_update_phase = false;
The exception is thrown at address 1034D9D0
So the last instructions being executed are
0F97D9CD mov eax,dword ptr [ebp-18h]
0F97D9D0 mov dword ptr [eax+28h],0FFFFFFFFh
m_prim_channel_registry
address is in [ebp-18h] and eax, and [eax+28h] is m_update_last
.
Looking in the watch window at esp and ebp before the inner call perform_update()
, I see that:
ebp-18h 0x0022fd5c unsigned int
esp 0x0022fd60 unsigned int
This is strange. The difference between them is only 4 bytes and the next push to the stack will make them equal and overwrite [ebp-18h]!
[ebp-18h] holds a copy of this->m_prim_channel_registry
. The call 1034D9BE call dword ptr [eax+14h]
, as it pushes the stack, corrupts the contents of ebp-18h. It looks like the stack has grown (downwards) too much, and corrupts the previous frame.
My questions are:
- Am I analyzing the issue correctly? Did I miss something here?
- What could cause such a corruption? I assume the issue is not related to either the compiler or the SystemC library, probably something that happened earlier someplace else.
- What are the techniques for debugging such a corruption?
Update
I believe I found the problem, but I can't say I understand this completely.
In the same function (sc_simcontext::crunch
) where the external perform_update()
is invoked, systemc methods are invoked:
// execute method processes
sc_method_handle method_h = pop_runnable_method();
while( method_h != 0 ) {
try {
method_h->execute();
}
catch( const sc_exception& ex ) {
cout << "\n" << ex.what() << endl;
m_error = true;
return;
}
method_h = pop_runnable_method();
}
These methods are deferred function calls registered earlier.
One of these methods was returning by executing ret 4
thus shrinking the stack frame every time it was called, to the point where the corruption described above happened.
And how did I manage registering a corrupted systemc method?
Apparently it's a bad idea using SC_METHOD(f)
when f is a virtual function of the module. Doing that caused a different, unrelated "random" function to be called.
I'm not exactly sure why it happens this way and why this limitation exists. Also I don't remember seeing any warning about using virtual member functions as systemc methods, however it was definitely the problem. When debugging the method registration in the SC_METHOD call itself I could see the function pointer inside pointing to a different function than was given to the SC_METHOD macro.
To fix the problem I called SC_METHOD(wrapper_f)
, where wrapper_f
is a simple non virtual member function of the module, that all it does is calling f
, the original virtual function. That's it.