I have a section of code which runs in ROS runloop with 3 lambdas - two for modifying flags and one for invoking the ROS runloop until topics hit a particular condition. Essentially close to
bool wait_until_engine_started()
{
bool engine_started_state = false;
bool engine_started_event = false;
// Psuedocode for a subscription wrapper which invokes the below lambda
auto engine_state_checker =
[&](const EngineRosMessage::ConstPtr& msg) {
// Psuedocode for a validation function
if (engine_appears_to_be_on(msg))
{
// Modify the referenced boolean here
engine_started_state = true;
}
// Return to runloop
}
);
// Psuedocode for a subscription wrapper which invokes the below lambda
auto engine_code_checker =
[&](const DifferentEngineRosMessage::ConstPtr& msg) {
// Psuedocode for a validation function
if (engine_appears_to_be_on_via_different_method(msg))
{
// Modify the referenced boolean here
engine_started_event = true;
}
// Return to runloop
}
);
return utils::spin_until_condition([&](){
return engine_started_state && engine_started_event;
});
}
With
bool spin_until_condition(std::function<bool()> condition)
{
while(ros::ok() && !condition())
{
ros::spinOnce();
}
return ros::ok();
}
I am hitting a segfault in some cases with the lambda used in the spin_until_condition
lambda in some cases when specific sections of code unrelated to this section are included.
Probing in GDB shows that on my machine
- at the level of the
engine_started_event
declaration the address ofengine_started_event
is0x7fffffffc3ff
- inside the lambda
engine_code_checker
the address ofengine_started_event
is0x7fffffffc3ff
- inside the rvalue lambda in
spin_until_condition
the address ofengine_started_event
is originally0x7fffffffc3ff
, but afterengine_started_event = true
, moves to0x1007fffffffc3ff
at which point the segfault occurs
This behaviour is very reliably disabled by removal of a particular block of code that is unrelated to this block. Further more the above section of code is traversed twice - once before the problem-causing block and once after, with the issue occurring only on the second round.
AFAIK - there is no reason a reference should ever change its address, and the reliability of removing the problem block makes me think its responsible but I can't see how they would be affected given that the booleans and the 3rd lambda are stack allocated variables.
I'm running this on gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
- running this in gcc-7
did not cause a segfault which is making me suspect a compiler whoopsie. But I've learnt time and again that the compiler is usually pretty good at its job, and the fact that removing our code removes the issue seems to point strongly at our code. My guess now is that a bad memory write in the unrelated code causes the reference to change somehow.
Valgrind did also not show anything about it apart from the actual segfaulting access at 0x1007fffffffc3ff
So - the TL;DR
- How can a lambda's reference-capture address change its address in the way that it exhibits above (including strange cases of bad memory access)
- Are there any sensible ways of debugging this sort of situation so that I can catch the offending code doing a write where this reference lives
- Or is this a compiler whoopsie