Recently, I asked a question but got down votes since guys thought it is not clear. However, I have found a hint which needs some digging...
There is a command line program called fluent
. Problem is that in the Rocks, when I run it on the front-end and enter exit
, it will return to the command prompt.
5991 nodes, binary.
5991 node flags, binary.
Done.
> exit
mahmood@cluster:~$
However, when I run the same command (the application is on /export/
which is a NFS drive) on the compute node via ssh
, it doesn't return to the command prompt.
5991 nodes, binary.
5991 node flags, binary.
Done.
> exit
^C^C^Z
[1]+ Stopped /share/apps/fluent/bin/fluent 3d -g -t4 -i elbow.journal
mahmood@compute-0-3:~$ pkill fluent*
mahmood@compute-0-3:~$ fg
/share/apps/fluent/bin/fluent 3d -g -t4 -i elbow.journal
Terminated
As suggested, I tried with strace
and attached it to the process multiple times since the application runs on multicores. In one attempt, the application returned back to the terminal. I noticed that in in the last lines of the strace
, there is a difference between the outcome of futex
.
In the correct execution, I see:
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 12
setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(12, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(12, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0
setsockopt(12, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(12, F_SETFL, O_RDWR) = 0
connect(12, {sa_family=AF_INET, sin_port=htons(45470), sin_addr=inet_addr("10.10.10.251")}, 16) = 0
write(12, "12345\0", 6) = 6
write(12, "15 NORMAL_EXITING\0", 19) = 19
read(12, "\0", 1) = 1
close(12) = 0
futex(0x2b66afe5d9d0, FUTEX_WAIT, 12432, NULL) = 0
futex(0x2b66afc5c9d0, FUTEX_WAIT, 12427, NULL) = 0
close(6) = 0
close(7) = 0
close(8) = 0
close(9) = 0
close(10) = 0
shmdt(0x2b66af7d8000) = 0
shmdt(0x2b66b0018000) = 0
shmdt(0x2b66af3a8000) = 0
shmdt(0x2b66af638000) = 0
shmdt(0x2b66af758000) = 0
shmdt(0x2b66aff78000) = 0
shmdt(0x2b66af6d8000) = 0
shmdt(0x2b66afed8000) = 0
close(4) = 0
close(5) = 0
exit_group(0) = ?
Process 12420 detached
and in the buggy run, I see:
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
setsockopt(9, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(9, SOL_SOCKET, SO_SNDBUF, [65536], 4) = 0
setsockopt(9, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
fcntl(9, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(9, F_SETFL, O_RDWR) = 0
connect(9, {sa_family=AF_INET, sin_port=htons(50825), sin_addr=inet_addr("10.10.10.251")}, 16) = 0
write(9, "12345\0", 6) = 6
write(9, "15 NORMAL_EXITING\0", 19) = 19
read(9, "\0", 1) = 1
close(9) = 0
futex(0x2b74f03659d0, FUTEX_WAIT, 13135, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b74f01649d0, FUTEX_WAIT, 13132, NULL) = 0
close(6) = 0
close(7) = 0
shmdt(0x2b74efce0000) = 0
shmdt(0x2b74f03e0000) = 0
shmdt(0x2b74efbe0000) = 0
shmdt(0x2b74f0480000) = 0
shmdt(0x2b74ef8b0000) = 0
shmdt(0x2b74efb40000) = 0
shmdt(0x2b74efc60000) = 0
shmdt(0x2b74f0520000) = 0
close(4) = 0
close(5) = 0
exit_group(0) = ?
Process 13129 detached
As you can see, although both of them say exit_group(0)
, the latter says a resource is temporarily unavailable.
Any thought on that?