9

I have an issue with my code that has some very strange symptoms.

  1. The code is compiled on my computer with the following versions:

    a. GCC Version: 4.4.2

    b. CMAKE verson: 2.8.7

    c. QNX (operating system) version: 6.5.0

And the code has a segfault whilst freeing some memory and exiting from a function (not dying on any code, just on the exit from a function).

The weird things about this are:

  1. The code does it in release mode but not debug mode:

    a. The code is threaded so this indicates a race condition.

    b. I cannot debug by putting it in debug mode.

  2. The code when compiled on a workmates machine with the same versions of everything, does not have this problem.

    a. The wierd things about this are that the workmates code works, but also that the binary created from compiling on his machine, which is the same, is about 6mB bigger.

Now annoyingly I cannot post the code because it is too big and also for work. But can anyone point me along a path to fixing this.

Since I am using QNX I am limited for my debug tools, I cannot use Valgrind and since it is not supported in QNX, GDB doesn't really help.

I am looking for anyone who has had a similar/same problem and what the cause was and how they fixed it.

EDIT:

Sooo... I found out what it was, but im still a bit confused about how it happened.

The culprit code was this:

Eigen::VectorXd msBb = data.modelSearcher->getMinimumBoundingBox();

where the definition for getMinimumBoundingBox is this:

Eigen::VectorXd ModelSearcher::getMinimumBoundingBox();

and it returns a VectorXd which is always initialised as VectorXd output(6, 1). So I immediately thought, right it must be because the VectorXd is not being initialised, but changing it to this:

Eigen::VectorXd msBb(6, 1); msBb = data.modelSearcher->getMinimumBoundingBox();

But this didn't work. In fact I had to fix it by changing the definition of the function to this:

void ModelSearcher::getMinimumBoundingBox(Eigen::MatrixXd& input);

and the call to this

Eigen::VectorXd msBb(6, 1); data.modelSearcher->getMinimumBoundingBox(msBb);

So now the new question:

What the hell? Why didn't the first change work but the second did, why do I have to pass by reference? Oh and the big question, how the hell didn't this break when my co-worker compiled it and I ran it? Its a straight out memory error, surely it shouldn't depend on which computer compiles it, especially since the compiler and all the other important things are the same!!??

Thanks for your help guys.

Fantastic Mr Fox
  • 32,495
  • 27
  • 95
  • 175
  • 2
    Sounds like a memory issue - valgrind in debug mode is still worth a try - it will help decide if it is memory or not. – John3136 Nov 16 '12 at 01:23
  • 2
    "1.The code does it in release mode but not debug mode:" - that is a not uncommon symptom of a memory issue (due to footprint differences) – Mitch Wheat Nov 16 '12 at 01:27
  • @John3136 Unfortunately I cant use valgrind because I am running on QNX which has no support. I can try to take the function out and into linux (ubuntu or fedora) but that will take ages so I want to try other things first. – Fantastic Mr Fox Nov 16 '12 at 01:29
  • Look for an uninitialized pointer. In debug mode generally variables get zeroed out. Since it is happening with multi-threading it may be a pointer that normally gets set, but due to a race condition is left with its initial value and thus when it is freed/deleted causes a segfault. – pstrjds Nov 16 '12 at 01:29
  • Typically optimizations and symbols are different compiler flags. Have you tried compiling with optimizations on in debug? Have you used a diff program on the two binaries? Have you tried moving the binary between systems? Have you checked what is different about the two systems? Is it consistent? – Yakk - Adam Nevraumont Nov 16 '12 at 01:30
  • @Yakk The systems are essentially the same, i dont think diff on a binary file will be very useful? The binary from my workmates computer works on all the computers we have tried, the one from mine works on no computers that we have tried. Im pretty sure all of the compile flags are set the same, but i will check. – Fantastic Mr Fox Nov 16 '12 at 01:32
  • Shouldnt you guys add these as answers, they are all good advice and I would happily upvote them. Since the answer is pretty broad and I cant provide code, these would pass as answers and not just comments. – Fantastic Mr Fox Nov 16 '12 at 01:36
  • For the first fix, you are initializing something and then immediately replacing its value via assignment, so it's not surprising that this had no effect. (I don't have enough information to answer the rest of your question.) – jdigital Nov 16 '12 at 18:27
  • Your workmate's machine might have different libraries. Have you compareed your tool installation with his? – jdigital Nov 16 '12 at 01:58
  • I have not, how do i do that? – Fantastic Mr Fox Nov 16 '12 at 02:16
  • You could start with a tree compare (for example. WinMerge http://winmerge.org/). For binaries, it will show files that are different (but won't actually do a diff, which is fine). If you can't access his machine over the network, just zip up everything on one machine and unzip it on the other. – jdigital Nov 16 '12 at 03:28
  • Sounds like you didn't follow the rule of three, and that copies are breaking the object. – Lightness Races in Orbit Nov 23 '12 at 10:27

2 Answers2

8

... the binary created from compiling on his machine, which is the same, is about 6mB bigger

It's worth figuring out what the difference is (even if it's just the case that his build hides, while yours exposes, a real bug):

  • double-check you're compiling exactly the same code (no un-committed local changes, no extra headers in the include search path, etc.)
    • triple-check by adding a -E switch to your gcc arguments in cmake, so it will pre-process your files with the same include path as regular compilation; diff the pre-processor output
  • compare output from nm or objdump or whatever you have to for your two linked executables: if some system or 3rd-party library is a different version on one box, it may show up here
  • compare output from ldd if it's dynamically linked, make sure they're both getting the same library versions
    • compare the library versions it actually gets at runtime too, if possible. Hopefully you can do one of: run pldd, compare the .so entries in /proc/pid/map, run the process under strace/dtrace/truss and compare the runtime linker activity

As for the code ... if this doesn't work:

Eigen::VectorXd ModelSearcher::getMinimumBoundingBox();
// ...
Eigen::VectorXd msBb(6, 1); msBb = data.modelSearcher->getMinimumBoundingBox();

and this does:

void ModelSearcher::getMinimumBoundingBox(Eigen::MatrixXd& input);
// ...
Eigen::VectorXd msBb(6, 1); data.modelSearcher->getMinimumBoundingBox(msBb);

you presumably have a problem with the assignment operator. If it does a shallow copy and there is dynamically-allocated memory in the vector, you'll end up with two vectors holding the same pointer, and they'll both free/delete it.

Note that if the operator isn't defined at all, the default is to do this shallow copy.

Useless
  • 64,155
  • 6
  • 88
  • 132
0

You said you have to change from:

void ModelSearcher::getMinimumBoundingBox(Eigen::MatrixXd& input);

What was it before?

If it was:

void ModelSearcher::getMinimumBoundingBox(Eigen::MatrixXd input);

and the copy constructors / assignment operators weren't implemented properly it might have caused the problem.

Please do check how they are both implemented. Here's some info that might help.

serengeor
  • 69
  • 1
  • 2