3

I have a set of single threaded C++ mains that have a bulk of their code developed in Ada. This is being built in Atego (Rational) Apex Duo, and targets 32 bit RHEL 6.3 Linux. The exec is a Class system I developed that includes sockets, a state machine, and a timer class that is the heart of the exec. The class system is used to build and execute 14 separate execs on 6 different systems that communicate via sockets. They all use the same Class system and configure themselves at startup based on an INI file.

The execs frame either at 50 or 60 Hz using the linux system clock via gettimeofday

struct timeval {
    time_t      tv_sec;     /* seconds */
    suseconds_t tv_usec;    /* microseconds */
};

and a simple non-busywait algorithm to produce the desired scheduler.

The problem I am facing at the moment is that these execs fail (seemingly) randomly. The failure appears to be that they simply stop framing. I have encapsulated all of the runtime C++ procedures in "try{} catch(...){}" and nothing is caught. Likewise, the Ada routines are protected with Exception Handlers that do not get hit.

The exec will often run happily for 30 minutes to over an hour before failure. No indication of memory creep is evident (using System Monitor). I have attached the Atego Rational Graphical Debugger to the already running execs, to no avail. When one finally fails, there is nothing in the call stack and the only indication in the debugger log is that the application has "Exited with status 255".

I am afraid that a (Linux) system routine or driver somewhere in the system is calling Exit. It seems apparent that SOMETHING is calling exit!

Anyone have any idea how I can further dubug this problem?

Sneftel
  • 40,271
  • 12
  • 71
  • 104
raetza
  • 39
  • 8
  • Last time I had an issue like this I was using a library with a retain/release mechanism and I released a constant by mistake. It gradually lowered in retain count until it was freed, and an error occurred after several hours of running. I mention this because the question as you've posed it doesn't seem to give enough information to attempt an answer, so that's the best I've got. – Dave Mar 23 '13 at 02:45
  • Are the execs symmetric, or all different?. Do they all fail or only some of them? –  Mar 23 '13 at 12:19
  • have you tried to increase the coredumpsize? – egilhh Mar 23 '13 at 13:13
  • Is there a particular reason you're using the C++ timing functions? If your C++ main is a harness-type program then would it be too much work to write an analogous one in Ada and try that? {For purposes of excluding sections of code; if it persists then it's very likely something on the Ada-side, if it vanishes it's something on the C++ side.} – Shark8 Mar 23 '13 at 16:49
  • You could try [wrapping `exit()`](http://stackoverflow.com/questions/3662856/how-to-reimplement-or-wrap-a-syscall-function-in-linux) with something to log a stack trace. – Simon Wright Mar 23 '13 at 20:21
  • egilhh, the execs aren't producing a core dump. It appears that exit is being called from within somewhere... – raetza Mar 23 '13 at 21:54
  • Shark8, the code has a lot of I/O in terms of unicast and multicast sockets. For the purpose of applying class systems to the implementation, it was far easier for me to use C++ for the main – raetza Mar 23 '13 at 21:55
  • Simon, thanks...sounds like a good suggestion since I am currently getting no trace. I had already decided to try using atexit() for that same purpose. – raetza Mar 23 '13 at 22:00

3 Answers3

2

This should probably not be an answer but it's too wide ranging for a comment...

So the failures are in a single executable on each of 6 machines; the executable that is responsible for sharing data across the network; right? And the "local" executables seem to be reliable... Or do the 6 faulty ones not map onto the 6 systems so cleanly?

Is failure related to network loading, e.g. latency exceeding your (TV or AC mains) frame rate? Making it fail faster by clogging the network can simplify testing...

I had a system fail when the Linux network clock ran backwards... the failure was in a C++ component so no easy Constraint Error when dt went negative, but an absurd "timeout at 4e9 microseconds" far down the line from the failure...

It sounds as though Ada's tasking facilities and the Distributed Systems Annex would be ideal for this application, but that level of design change is probably not appropriate at this stage.

  • Brian, The 6 execs that fail do map to 6 machines...and they are the execs responsible for I/O (UDP, Multicast, MIL-STD-1553). The execs are completely single threaded and execute as if they were run via a RTOS (they are ported from VxWorks on PPC to Linux on x86-86). The C++ mains were my choice as it's far easier for me to develop in C++ as to Ada...especially when doing a lot of socket stuff. – raetza Mar 26 '13 at 00:22
  • I now believe the problem is being trapped by the Ada RTS, it is generating the exit() call via a panic. I learned this via the Call Stack by trapping exit() in the main with an atexit() and setting a breakpoint on it. However, I am still searching for the cause of the problem...what error is causing the Ada RTS to call exit()? My feeling is that it is in the Ada code somewhere...or something odd like Stack going out of bounds. – raetza Mar 26 '13 at 00:28
  • Panic and exit; doesn't sound like something an Ada RTS would do. Unless it's raising an exception to be handled by the main subprogram which (not being Ada) doesn't recognise the exception. Is all the Ada code wrapped in a "begin ... do stuff ... exception ... print stuff ... end" layer? Alternatively, would it be too hard to create an Ada main for ONE of the failing tasks? –  Mar 26 '13 at 10:02
  • addendum to prev comment re panic and exit : provided the standard runtime checks are on. Is it possible the execs have been built without them? –  Mar 26 '13 at 12:48
  • Brian, via Atego tech support we have determined that the Ada RTS contains a symbol called panic_exit_application_no_ct. This is commonly caused by a Task deadlock, but they say it can also be from other causes (as must me our case). This is appearing in the debugger call stack with the call to exit(), so we know the exit() is being called from the Ada RTS...we just have no idea yet why. – raetza Mar 26 '13 at 16:05
  • The Ada cyclic run calls a large set of procedures, based on access to them via pointers to them from a queue system. This queue system has mulitple rate groups so that during a particular call to the run cycle, only the set of procedures appropriate for the current rate group are called from the queues. The failures always occur during execution of a queued procedure...but not the same one :(. It appears random...but we are looking for non-randomness. – raetza Mar 26 '13 at 16:08
  • Okay, I have no experience of the Atego RTS. Probably my last (good or otherwise :-) Q : any sign of failure correlating with high network load? –  Mar 26 '13 at 16:28
  • Brian, no...certain of this on two fronts. 1) The network load is not near any bandwidth limitation and we are not dropping packets, and 2) The failures occur in the same manner if I shut off all network send/receives. – raetza Mar 26 '13 at 16:44
0

maybe it can give you some further direction:

if you try to return double from main this might happen. see:

double main ()
{
    return 0.0;
}
$ cc double.c
double.c: In function 'main':
double.c:2: warning: return type of 'main' is not 'int'
$ ./a.out
$ echo $?
255
$

If, under a POSIX OS you attempt to return double from main(), the calling process will almost always see 255.

4pie0
  • 29,204
  • 9
  • 82
  • 118
  • Brian, good question. Actually only 6 of the 14 execs exhibit this behavior. These 6 all have I/O associated with them. The I/O is between boxes (separate systems) and consists of UDP and Multicast Ethernet socket data via NICs in the PCs and MIL-STD-1553 data via 1553 cards in the PCs. The remaining 8 execs run with other execs and only have internal (Shared Memory) data exchanges. These 8 seem to run forever without fail. Also, all of the execs are different...they just have in common the Class systems I developed for socket I/O, INI reading, Timing and State Machine Exec. – raetza Mar 23 '13 at 12:57
  • Shark8, the code has a lot of I/O in terms of unicast and multicast sockets. For the purpose of applying class systems to the implementation, it was far easier for me to use C++ for the main. – raetza Mar 23 '13 at 21:53
0

It now appears that the Ada RTS was not "happy" with the C++ main. Certainly not all is answered...but fixed to work. We changed from C++ mains to Ada mains...which was a simple change...where the Ada now imports a bunch of C++ rather than the other way around. It's not as pristine as the C++ main...but functional over pretty I guess.

The unanswered question is why did it die in the first place?...and why only some of the execs...while others went on running "forever"? We suspect it has to do with something going out of bounds, eventually, and the more that an application had to do in calls to routines in Ada...the quicker it died. This leaves us believing that the Stack was being corrupted...but why only with the C++ main?

Regardless, it works with the Ada mains and working is the overriding requirement...so we're calling it "fixed".

raetza
  • 39
  • 8