2

Recently, I asked a question but got down votes since guys thought it is not clear. However, I have found a hint which needs some digging...

There is a command line program called fluent. Problem is that in the Rocks, when I run it on the front-end and enter exit, it will return to the command prompt.

    5991 nodes, binary.
    5991 node flags, binary.
Done.

> exit
mahmood@cluster:~$

However, when I run the same command (the application is on /export/ which is a NFS drive) on the compute node via ssh, it doesn't return to the command prompt.

    5991 nodes, binary.
    5991 node flags, binary.
Done.

> exit
^C^C^Z
[1]+  Stopped                 /share/apps/fluent/bin/fluent 3d -g -t4 -i elbow.journal
mahmood@compute-0-3:~$ pkill fluent*
mahmood@compute-0-3:~$ fg
/share/apps/fluent/bin/fluent 3d -g -t4 -i elbow.journal
Terminated

As suggested, I tried with strace and attached it to the process multiple times since the application runs on multicores. In one attempt, the application returned back to the terminal. I noticed that in in the last lines of the strace, there is a difference between the outcome of futex.

In the correct execution, I see:

    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 12
    setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(12, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0
    setsockopt(12, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
    fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
    fcntl(12, F_SETFL, O_RDWR)              = 0
    connect(12, {sa_family=AF_INET, sin_port=htons(45470), sin_addr=inet_addr("10.10.10.251")}, 16) = 0
    write(12, "12345\0", 6)                 = 6
    write(12, "15  NORMAL_EXITING\0", 19)   = 19
    read(12, "\0", 1)                       = 1
    close(12)                               = 0
    futex(0x2b66afe5d9d0, FUTEX_WAIT, 12432, NULL) = 0
    futex(0x2b66afc5c9d0, FUTEX_WAIT, 12427, NULL) = 0
    close(6)                                = 0
    close(7)                                = 0
    close(8)                                = 0
    close(9)                                = 0
    close(10)                               = 0
    shmdt(0x2b66af7d8000)                   = 0
    shmdt(0x2b66b0018000)                   = 0
    shmdt(0x2b66af3a8000)                   = 0
    shmdt(0x2b66af638000)                   = 0
    shmdt(0x2b66af758000)                   = 0
    shmdt(0x2b66aff78000)                   = 0
    shmdt(0x2b66af6d8000)                   = 0
    shmdt(0x2b66afed8000)                   = 0
    close(4)                                = 0
    close(5)                                = 0
    exit_group(0)                           = ?
    Process 12420 detached

and in the buggy run, I see:

    socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
    setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    setsockopt(9, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0
    setsockopt(9, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
    fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
    fcntl(9, F_SETFL, O_RDWR)               = 0
    connect(9, {sa_family=AF_INET, sin_port=htons(50825), sin_addr=inet_addr("10.10.10.251")}, 16) = 0
    write(9, "12345\0", 6)                  = 6
    write(9, "15  NORMAL_EXITING\0", 19)    = 19
    read(9, "\0", 1)                        = 1
    close(9)                                = 0
    futex(0x2b74f03659d0, FUTEX_WAIT, 13135, NULL) = -1 EAGAIN (Resource temporarily unavailable)
    futex(0x2b74f01649d0, FUTEX_WAIT, 13132, NULL) = 0
    close(6)                                = 0
    close(7)                                = 0
    shmdt(0x2b74efce0000)                   = 0
    shmdt(0x2b74f03e0000)                   = 0
    shmdt(0x2b74efbe0000)                   = 0
    shmdt(0x2b74f0480000)                   = 0
    shmdt(0x2b74ef8b0000)                   = 0
    shmdt(0x2b74efb40000)                   = 0
    shmdt(0x2b74efc60000)                   = 0
    shmdt(0x2b74f0520000)                   = 0
    close(4)                                = 0
    close(5)                                = 0
    exit_group(0)                           = ?
    Process 13129 detached

As you can see, although both of them say exit_group(0), the latter says a resource is temporarily unavailable.

Any thought on that?

Stefan van den Akker
  • 6,661
  • 7
  • 48
  • 63
mahmood
  • 23,197
  • 49
  • 147
  • 242
  • It looks like the second example hasn't closed all the sockets and might be holding a lock on the process (notice the discrepancy in `close(n)`. Maybe try running the process in a `screen` session... – l'L'l Jul 09 '16 at 21:49
  • I found this topic http://stackoverflow.com/questions/14370489 related to mine. In the answer there is a time out parameter but I don't know how to set that. Is it possible to change the kernel parameters for that? I don't have access to the source code of the binary – mahmood Jul 10 '16 at 11:15

0 Answers0