3

In my program, I have a thread which has to continuously monitor the network interfaces therefore it continuosly uses getifaddrs() in a while loop.

    while(true) {
    
        struct ifaddrs *ifaddr, *ifa;
        if (getifaddrs(&ifaddr) == -1) {
            perror("getifaddrs couldn't fetch required data");
            exit(EXIT_FAILURE);
        }
  
        //Iterate through interfaces linked list
        for (ifa = ifaddr; ifa != NULL; ifa = ifa->ifa_next) {
        //monitoring logic
        }

       //Free linked list
       freeifaddrs(ifaddr);

       //Sleep for specified time fo next polling cycle
       usleep(1000);
    
    }

Most of the time my program works fine. However, sometimes getifaddrs() returns -1 and errNo = EBADF(bad file descriptor). In order to not exit my thread, I have temporarily replaced exit with continue(as I don't want my program to end due to this). However, I'm curious to know in which cases can getifaddrs() return 'bad file descriptor' error and whether I can do something so that this does not happen?

EDIT

replacing 'exit' with 'continue' didn't solve my problem. Sometimes the call to getifaddrs() is crashing the application!

Given below is the backtrace obtained from gdb using the generated core file.

Program terminated with signal 6, Aborted.
#0  0x00007fe2df1ef387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007fe2df1ef387 in raise () from /lib64/libc.so.6
#1  0x00007fe2df1f0a78 in abort () from /lib64/libc.so.6
#2  0x00007fe2df231ed7 in __libc_message () from /lib64/libc.so.6
#3  0x00007fe2df231fbe in __libc_fatal () from /lib64/libc.so.6
#4  0x00007fe2df2df4c2 in __netlink_assert_response () from /lib64/libc.so.6
#5  0x00007fe2df2dc412 in __netlink_request () from /lib64/libc.so.6
#6  0x00007fe2df2dc5ef in getifaddrs_internal () from /lib64/libc.so.6
#7  0x00007fe2df2dd310 in getifaddrs () from /lib64/libc.so.6
#8  0x000000000047c03c in __interceptor_getifaddrs.part.0 ()

Operating system: Red Hat Enterprise Linux Server release 7.8 (Maipo)

GLIBC version: 2.17

Vishal Sharma
  • 1,670
  • 20
  • 55
  • 3
    Citing the manpage https://man7.org/linux/man-pages/man3/getifaddrs.3.html: *getifaddrs() may fail and set errno for any of the errors specified for socket(2), bind(2), getsockname(2), recvmsg(2), sendto(2), malloc(3), or realloc(3).* Some of these functions specify `EBADF` as a possible `errno` value. You can try to reproduce the error with a system call trace (`strace`). This should show which system call failed and might help to analyze the cause of the problem. – Bodo Dec 06 '21 at 08:59
  • Or maybe you mess up `ifaddr` content in your "monitoring logic"? – Matthieu Dec 09 '21 at 18:46
  • check this link. it may help you to identify the reason why it crashs. https://patchwork.ozlabs.org/project/netdev/patch/5638B93F.3090202@redhat.com/ – idris Dec 10 '21 at 15:10
  • 1
    So ... you setup a bounty to "draw more attention to this question" then simply ignore all answers and attempts to help? @RainerKeller provided a very interesting solution and I'm curious to know whether it helped you. – Matthieu Dec 14 '21 at 23:59
  • No doubt @RainerKeller's answer has taken our investigation forward..but he himself mentioned that he doesn't really answer the core question...and that's why I hadn't rewarded the bounty yet, hoping that some more responses might come. – Vishal Sharma Dec 15 '21 at 05:55

4 Answers4

3

The following example from the man-page amended to include your busy-loop with the usleep ran for minutes bare and under valgrind without throwing an error; albeit my server does not have any network interfaces failing or going live while running this example.

I tested on CentOS 7.9 which has glibc-2.17-323.el7_9.x86_64.

#include <arpa/inet.h>
#include <sys/socket.h>
#include <netdb.h>
#include <ifaddrs.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    struct ifaddrs *ifaddr, *ifa;
    int family, s;
    char host[NI_MAXHOST];

    while (1) {
        if (getifaddrs(&ifaddr) == -1) {
            perror("getifaddrs");
            exit(EXIT_FAILURE);
        }

        /* Walk through linked list, maintaining head pointer so we
          can free list later */

        for (ifa = ifaddr; ifa != NULL; ifa = ifa->ifa_next) {
            if (ifa->ifa_addr == NULL)
                continue;
            family = ifa->ifa_addr->sa_family;
            /* Display interface name and family (including symbolic
               form of the latter for the common families) */
            // Commented out
        }
        freeifaddrs(ifaddr);
        usleep(1000);
    }
    exit(EXIT_SUCCESS);
}

What's interesting though: GNU's glibc-2.17 does not feature the assert __netlink_assert_response, but GNU's glibc-2.31 does. So, this is something, that RedHat patched in later (You may revisit my steps using):

SRC=`basename $(rpm -q glibc) .x86_64`.src.rpm
wget --no-check-certificate http://vault.centos.org/7.9.2009/updates/Source/SPackages/${SRC}
CPIO=`basename ${SRC} .rpm`.cpio
rpm2cpio ${SRC} > ${CPIO}
mkdir glibc-src && cd glibc-src
cpio -ivd < ${CPIO}

This shows, the assert that fails in your case was added by Patch glibc-rh1443872.patch, which states:

commit 2eecc8afd02d8c65cf098cbae4de87f332dc21bd

Author: ...

Date: Mon Nov 9 12:48:41 2015 +0100

Terminate process on invalid netlink response from kernel [BZ #12926]

The Bugzilla entry https://sourceware.org/bugzilla/show_bug.cgi?id=12926 gives details on NetLink interface being lossy.

Now all of that doesn't answer your issue: why does getifaddrs fail and glibc killing your process with signal SIGABRT.

Like [@matthieu] let's assume you don't mess up your Stack and/or the pointer ifaddr in your monitoring logic, this still could be a communication error between kernel and glibc and would require further investigation. As a work-around, You might temporarily catch the abort signal as is described in How to Handle SIGABRT signal?

EDIT: Of course, if You special case for EBADF, You nevertheless have to freeifaddrs(ifaddr) prior to continuing...

Rainer Keller
  • 355
  • 2
  • 9
1

https://patchwork.ozlabs.org/project/netdev/patch/5638B93F.3090202@redhat.com/

in the link it says the reason of crash is. "The recvmsg system calls for netlink sockets have been particularly prone to picking up unrelated data after a file descriptor race (where the descriptor is closed and reopened concurrently in a multi-threaded process, as the result of a file descriptor management issue elsewhere).".

So i think you need either dont use a seperate thread or use some locking mechanism around netlink functions.

At least just confirm that it still crash or not when you monitor the network interfaces in main thread.

idris
  • 488
  • 3
  • 6
  • When we run this module(interface monitor) in a sample single-threaded program, then we've not been able to reproduce the issue...However, even in our multi-threaded application, there's only a single thread which is calling getifaddrs()...In rest of the other threads no such call to getifaddrs is being made or even to any netlink function(AFAIR, will recheck)...Also I've run the application with address sanitiser and thread sanitiser and till now no related issue has been found. – Vishal Sharma Dec 15 '21 at 07:53
  • So we can confirm that crash happens under multithread environments. am i wrong? @VishalSharma – idris Dec 15 '21 at 08:03
  • yes..that's what the observation has been till now. – Vishal Sharma Dec 15 '21 at 08:13
  • 1
    How about monitoring in a seperate process and communicate with your main app with IPC? – idris Dec 15 '21 at 08:16
  • Yeah...I guess that might be a better solution – Vishal Sharma Dec 15 '21 at 09:07
0

According to man7.org getifaddrs, any of the socket operations could be a cause for EBADF

ERRORS

getifaddrs() may fail and set errno for any of the errors specified for socket(2), bind(2), getsockname(2), recvmsg(2), sendto(2), malloc(3), or realloc(3).


Unrelated, but do you do freeifaddrs() somewhere?

Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
  • Yes I do use freeifaddrs(). I've edited the code in the question. – Vishal Sharma Dec 06 '21 at 09:28
  • 2
    IMHO this does not answer the question from a user's or application programmer's point of view "*in which cases can getifaddrs() return 'bad file descriptor' error and whether I can do something so that this does not happen*". The list of function calls that may set this `errno` value does not really explain in which situation the error may occur. – Bodo Dec 06 '21 at 09:32
0

Fortunately I've been able to trace the root cause behind the issue. The scenario is already explained in detail here.

So basically, one thread in my program was having this 'double-close' bug which was causing the issue sometimes.

Vishal Sharma
  • 1,670
  • 20
  • 55