0

I'm having one of those moments where I'm sure there is some obvious thing I'm missing but I can't see it for looking.

We have some code (Not Invented Here, natch) which looks something like this (I've made it pseudocode for ease of reading):

struct outputs_struct{
    char *SomeString;
};

int DoSomething(struct allthings_struct *AllThings)
{
    struct inputs_struct The_Inputs;
    struct outputs_struct The_Outputs;
    int error = 0;

    // Populate input data, then:

    error = DoGetOutputsFromInputs(Allthings, &The_Inputs, &The_Outputs);

    return error;
}


int DoGetOutputsFromInputs(struct allthings_struct *AllThings, struct input_struct *Inputs, struct outputs_struct *Outputs)
{
    // Some reading of input data, then:

    Outputs->SomeString = (char *)malloc(100);
    strcpy(Outputs->SomeString, "Hello,world");

    // Some other stuff

    return 0;
}

As soon as this function returns, we get a SEGFAULT.

It SEGFAULTs immediately on coming back from DoGetOutputsFromInputs(). Likewise if I print markers & pause before the return statement in DoGetOutputsFromInputs() it is fine right up to the moment it actually returns.

I have also tried upping my caffeine dosage, experiments are ongoing in that department, so far: no progress.

Edit 1: Further testing reveals it's not the malloc() that's at fault / causing the issue, the code actually crashes if we return sooner than that part, so I think there is some oddness going on elsewhere that I will have to chase down.

Apologies for the vagueness and pseudocode, it's a huge steaming pile of code auto-generated by gSoap (which doesn't auto-generate any sort of comments or documentation, of course...) from ONVIF WSDL's, we're developing in Ubuntu and the target is a TI DaVinci DSP/ARM9 SoC. This code is a subsection of a corner of the TI SDK and hence various things are outside our immediate influence / too time-consuming to delve into.

John U
  • 2,886
  • 3
  • 27
  • 39
  • 1
    The_Outputs is of type `outputs_struct` but the parameter is of type `some_other_things_struct`. How do these two types relate? – Remus Rusanu Aug 26 '14 at 11:43
  • 4
    [Don't cast the result of `malloc`](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc/605858#605858). – Quentin Aug 26 '14 at 11:43
  • @RemusRusanu - Oops, that was a typo by me, the types are the same, I have edited to correct. – John U Aug 26 '14 at 11:51
  • 2
    Use valgrind to find your bugs. – John Zwinck Aug 26 '14 at 11:53
  • 2
    Seems you experience a stack overwrite example, but the code as it is here looks (is) pretty innocent (not considering the obvious memory leak) maybe you can recreate the error in a tiny piece of compilable and runnable app? – Ferenc Deak Aug 26 '14 at 11:55
  • @JohnZwinck - Unfortunately the combination of environments, platforms, hardware constraints and time means valgrind isn't a viable option at this exact moment. – John U Aug 26 '14 at 11:56
  • That's the thing with memory corruption bugs - the code where the error shows up may well be utterly unrelated to the code where the error actually originates. Do you have the facility in your platform to turn on any extra memory debugging, eg asserting on a double free or overwriting beyond the ends of the allocated space? Because 10 to 1 says you've got either a double free or a buffer overrun happening somewhere shortly before you run this part of the code. – Vicky Aug 26 '14 at 11:58
  • @JohnU: OK, what platform are you using? – John Zwinck Aug 26 '14 at 11:58
  • Your pseudocode doesn't show the error; try to produce a MCVE. Even if you don't know where the error is, you can do this by gradually removing sections of code until the error goes away, and so you can gradually hone in on the problem. – M.M Aug 26 '14 at 12:02
  • 2
    All - I have updated the question to reflect a few details. @MattMcNabb - I googled "MCVE" and got _"Milk for Cheese Value Equivalent"_, I assume it's _not_ that? – John U Aug 26 '14 at 12:14
  • @JohnU [see here for MCVE](http://stackoverflow.com/help/mcve) – M.M Aug 26 '14 at 12:15
  • @MattMcNabb - That makes more sense than _Mouse Cerebral Vascular Endothelial_ I am now working down that route. – John U Aug 26 '14 at 12:20

3 Answers3

1

Your example does not repro. I suspect that the referencing of the parent-frame-stack-declared The_Outputs is the culprit and somewhere on the code a cast is done that fools the compiler to write a few bytes higher on the stack, where exactly the ebp ret address would be, triggering the fault when execting the ret (I assume an x86 like stack architecture).

Running under gdb should make this fairly trivial to capture. Enter DoGetOutputsFromInputs and use watch to set a break-on-write on the stack ret address (see Can I set a breakpoint on 'memory access' in GDB?). Let it run, should break when the overwrite occurs (if my hypothesis is correct) and that instruction is your culprit.

Of course compiling with stack-smash protection would also capture the problem fairly easy, but where is the fun?

Community
  • 1
  • 1
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
1

Well to answer my own question and close this off / avoid wasting anyone's time... basically, it's not the malloc, it's unlikely it's even that function, there is something lurking in the code which isn't quite right and which I will have to devote a fair bit more time & coffee to tracking down.

Thanks all for the input.

Nurse, fetch the valium!

John U
  • 2,886
  • 3
  • 27
  • 39
0

Its impossible to say without the actual code but this could be due to memory corruption (e.g., buffer overflow or underflow) or UB (undefined behavior). If it is chances are the actual issue is happening somewhere else and just happens to show up at this point.

A few things you can do to narrow down the cause:

  • Use Valgrind or a similar tool to look for memory issues.
  • Create a minimal example code that replicates the issue.
  • Double-check all memory allocations, frees, and copies.
  • Test the DoGetOutputsFromInputs() to ensure it works as expected.
uesp
  • 6,194
  • 20
  • 15