1

My company sells linux-based devices with a number of executables. One of these applications is hanging every few days in the newest version of our product.

We are using glibc 2.19 and gcc 4.8.3 and Linux kernel version 3.16.38. We are building for x86_64.

Our glibc version is very old, and we supposedly patched it a year ago with the fix for: Bug #12926: getaddrinfo()/make_request() spins forever (https://sourceware.org/bugzilla/show_bug.cgi?id=12926)

The maintainer of our crosstool swears that the one we are using has a patched glibc. However, there are other failure possibilities, like our builds may be picking up a different glibc for some reason.

On our build machine, we save away unstripped versions of our application's executables and shared object binaries that we can later use when debugging core files.

I have generated a few core files by logging into a device with a hung application and sending the process a SIGILL.

The core files appear to show that we are experiencing a hang in getaddrinfo() and the stack traces look like the ones we used to get before patching glibc. Example from a recent core file using the latest deployed build:

Thread #18 1456 (Suspended : Container)
recvmsg() at 0x7f1fa276c17d
make_request() at 0x7f1fa278695d
__check_pf() at 0x7f1fa2786e54
getaddrinfo() at 0x7f1fa2759501

Thread #16 1454 (Suspended : Container)
__lll_lock_wait_private() at 0x7f1fa277777b
_L_lock_443() at 0x7f1fa2786f4d
__check_pf() at 0x7f1fa2786d05
getaddrinfo() at 0x7f1fa2759501

I would like to be able to verify which version of getaddrinfo() the release executables we have deployed are executing: patched or unpatched. Doing this on my personal development box won't help because that would only verify my own toolchain / build environment. Is there any way I can do this with the release binaries we have deployed?

EDIT: I forgot to mention that we link statically.

EDIT 2: I was wrong about static linking. We used to link pretty much everything statically, but we no longer link statically with system libraries. Thanks to those who pointed this out.

echawkes
  • 447
  • 2
  • 12
  • 2
    Do you link statically? – Florian Weimer Oct 02 '18 at 17:28
  • If you're using your build-server properly, you have a set of executables that you can compare by their hashes (stripping the executable will produce the same result whether it's stripped for delivery or stripped after the fact - and give a comparable hash). – Thomas Dickey Oct 02 '18 at 20:06
  • How about disassembling the function and comparing that to the disassembly of the known-good one? – Tom Tromey Oct 02 '18 at 21:16
  • You can run `ldd` on the compiled and installed binary to see what dynamic libraries it will actually pick up. However the debugger should tell you as well. – eckes Oct 03 '18 at 03:51
  • 1
    "forgot to mention that we link statically." -- you are almost certainly mistaken about that. – Employed Russian Oct 04 '18 at 02:21

2 Answers2

2

The changes in bug 12926 is merely a diagnostic aid. If you need them, you have a file descriptor race in your application. It may be easier to find as a result, but that is not clear. But application bugs related to file descriptor race conditions will definitely need independent fixes.

There was a bug in glibc itself which could trigger incorrect file descriptor reuse, bug 15946. This fix is far more important than the changes in bug 12926. Bug 15946 could materialize in many different ways, and a hang as in bug 12926 is one possibility.

Note that the change for bug 15946 affects libresolv, which is linked dynamically by default, even if the application is otherwise statically linked. Unless you override the build settings for glibc and link libresolv statically as well or arrange the search paths such that a copy of libresolv which you shipped is picked up, the system glibc will still have to be fixed.

You can try to look at /proc/PID/fd or lsof -p output once the next hang happens. Sometimes the file or socket behind the file descriptor gives you a clue to where it is coming from, and pinpoint the incorrect file descriptor reuse inside the application.

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
1

The maintainer of our crosstool swears that the one we are using has a patched glibc.

Unless you are linking statically (which, judging by the 0x7f1fa276c17d address in your stack trace you don't), the version of GLIBC in your crosstool likely doesn't matter.

However, there are other failure possibilities, like our builds may be picking up a different glibc for some reason.

Normally you would be picking up GLIBC from the system, and if that GLIBC is not similarly patched, then it is expected that you would still have the bug. That's how dynamic linking works.

It is possible to use your own GLIBC, installed in parallel with the system one. However, this is not entirely trivial.

Employed Russian
  • 199,314
  • 34
  • 295
  • 362