19

Yes, this question has been asked before, but reading the answers didn't enlighten me much.

I wrote a C program that crashes after a few days of use. An important point is that it does NOT generate a core file, even though everything is set up so that it should (core_pattern, ulimit -c unlimited, etc. I can trigger a core dump fine with kill -SIGQUIT).

The programs extensively logs what it does, but there's no hint about the crash in the log. The only message displayed at the crash (or before?) is:

XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
  after 2322 requests (2322 known processed) with 0 events remaining.

So two questions: - how is it possible for a program to crash (return $?=1) without core dump. - what is this error message about and what can I do ?

System is RedHat Enterprise 6.4

Edit: I managed to force a core dump by calling abort() from inside an atexit() callback:

(gdb) bt
#0  0x00bc8424 in __kernel_vsyscall ()
#1  0x0085a861 in raise () from /lib/libc.so.6
#2  0x0085c13a in abort () from /lib/libc.so.6
#3  0x0808f5cf in Unexpected () at MyCode.c:1378
#4  0x0085de9f in exit () from /lib/libc.so.6
#5  0x00c85701 in _XDefaultIOError () from /usr/lib/libX11.so.6
#6  0x00c85797 in _XIOError () from /usr/lib/libX11.so.6
#7  0x00c84055 in _XReply () from /usr/lib/libX11.so.6
#8  0x00c68b8f in XGetImage () from /usr/lib/libX11.so.6
#9  0x004fd6a7 in ?? () from /usr/local/lib/libcvi.so
#10 0x00478ad5 in ?? () from /usr/local/lib/libcvi.so
...
#29 0x001eed9d in ?? () from /usr/local/lib/libcvi.so
#30 0x001eee41 in RunUserInterface () from /usr/local/lib/libcvi.so
#31 0x0808fab4 in main (argc=2, argv=0xbfbdc984) at MyCode.c:1540

Anyone can enlighten me as to this X11 problem ? libcvi.so is not mine, only MyCode.c (LabWindows/CVI).

Edit 2014-12-05: Here's an even more precise backtrace. Things definitely happen in X11, but I'm no X11 programmer, so looking at the source code for X from the provided linestell me only that the X server (?) is temporarily unavailable. Is there any way to simply tell it to ignore this error if it's only temporary ?

#4  0x00965eaf in __run_exit_handlers (status=1) at exit.c:78
#5  exit (status=1) at exit.c:100
#6  0x00c356b1 in _XDefaultIOError (dpy=0x88aeb80) at XlibInt.c:1292
#7  0x00c35747 in _XIOError (dpy=0x88aeb80) at XlibInt.c:1498
#8  0x00c340a6 in _XReply (dpy=0x88aeb80, rep=0xbf82fa90, extra=0, discard=0) at xcb_io.c:708
#9  0x00c18c0f in XGetImage (dpy=0x88aeb80, d=27263845, x=0, y=0, width=60, height=20, plane_mask=4294967295, format=2) at GetImage.c:75
#10 0x005f46a7 in ?? () from /usr/local/lib/libcvi.so

Corresponding lines:

XlibInt.c: _XDefaultIOError()
1292:   exit(1);

XlibInt.c: _XIOError
1498:   _XDefaultIOError(dpy);

xcb_io.c: _XReply()
708:    if(!reply) _XIOError(dpy);

GetImage.c: XGetImage()
74: if (_XReply (dpy, (xReply *) &rep, 0, xFalse) == 0 || ...
xpt
  • 20,363
  • 37
  • 127
  • 216
dargaud
  • 2,431
  • 2
  • 26
  • 39
  • Your program may leaks descriptors. Look in its `/proc//fd` directory after it has been run some time; do you see an increased number of links in there? – n. m. could be an AI Sep 11 '14 at 19:14
  • It usually takes several days before it crashes, but I'll be monitoring the situation. Some googling lead me to believe that it is a Xinerama/NVidia multi-monitor problem unrelated to my app. – dargaud Sep 12 '14 at 09:00
  • Removing Xinerama didn't help. I still get those crashes without core dump. Any tool I can use to track it down ? – dargaud Sep 16 '14 at 09:42
  • Note that core dumps are disabled by default. You'd have to enable it by e.g. running `ulimit -c unlimited` in the same shell you launch the application from (or do it programatically from within the application with a setrlimit(RLIMIT_CORE, ... ) call) – nos Oct 08 '14 at 13:43
  • Yes, I know about ulimit, but that doesn't help. It seems the program quits via a call to exit(0) in some library. I just found out that I can catch it with atexit(). I placed a call to abort() inside and now I'm waiting for it to quit again (takes a few days). – dargaud Oct 09 '14 at 12:14
  • 1
    I just added a backtrace to the original post – dargaud Oct 15 '14 at 07:35
  • I am having similar problems. Did you ever find a solution to this? – Azmisov Jan 06 '15 at 01:32
  • 1
    Nope, no solution yet. Are you having this problem with CVI or some other system ? I'd like to know more. – dargaud Jan 06 '15 at 13:08

2 Answers2

11

OK, I finally found the cause (thanks to someone at National Instruments), a better diagnostic and a workaround.

The bug is in many versions of libxcb and is a 32-bit counter rollover problem that has been known for a few years: https://bugs.freedesktop.org/show_bug.cgi?id=71338

Not all versions of libxcb are affected libxcb-1.9-5 has it, libxcb-1.5-1 doesn't. From the bug list, 64-bits OS shouldn't be affected, but I managed to trigger it on at least one version.

Which brings me to a better diagnostic. The following program will crash in less than 15 minutes on affected libraries (better than the entire week it previously took):

// Compile with: gcc test.c -lX11 && time ./a.out
#include <X11/Xlib.h>
void main(void) {
    Display *d = XOpenDisplay(NULL);
    if (d)
     for(;;)
        XNoOp(d);
}

And one final thing, the above prog compiled and ran on a 64-bit system works fine, compiled and ran on an old 32-bit system also works fine, but if I transfer the 32-bit version to the 64-bit system, it crashes after a few minutes.

dargaud
  • 2,431
  • 2
  • 26
  • 39
4

I just had a program that acted exactly like this, with exactly the same error message. I would expect the counter error to process 2^32 events before crashing.

The program was structured so that a worker thread has a separate X connection to the X thread so that it can send messages to the X thread to update the window.

In the end I traced the problem down to a place where the function sending the events to the window to redraw it was called by multiple threads, without a mutex on it, and since X to the same X connection is not re-entrant, crashed with this exact error. Put in a mutex on the function and no problems since.

camelccc
  • 2,847
  • 8
  • 26
  • 52