1

I have a network program written in C using TCP sockets. Sometimes the client program hangs forever expecting input from server. Specifically, the client hangs on select() call set on an fd intended to read characters sent by server.

I am using strace to know where the process got stuck. However, sometimes when I attach the hung client process to strace, it immediately resumes it's execution and properly exits. Not all hung processes exhibit this behavior, some processes stuck in the select() even if I attach them to strace. But most of the processes resume their execution when attached to strace.

I am curious what causing the processes resume when attached to strace. It might give me clues to know why client processes are getting hung.

Any ideas? what causes a hung process to resume it's execution when attached to strace?

Update:

Here's the output of strace on hung processes.

> sudo strace -p 25645
Process 25645 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[ Process PID=25645 runs in 32 bit mode. ]
select(6, [3 5], NULL, NULL, NULL)      = 2 (in [3 5])
read(5, "\0", 8192)                     = 1
write(2, "", 0)                         = 0
read(3, "====Setup set_oldtempbehaio"..., 8192) = 555
write(1, "====Setup set_oldtempbehaio"..., 555) = 555
select(6, [3 5], NULL, NULL, NULL)      = 2 (in [3 5])
read(5, "", 8192)                       = 0
read(3, "", 8192)                       = 0
close(5)                                = 0
kill(25652, SIGKILL)                    = 0
exit_group(0)                           = ?
Process 25645 detached

_

> sudo strace -p 14462
Process 14462 attached - interrupt to quit
[ Process PID=14462 runs in 32 bit mode. ]
read(0, 0xff85fdbc, 8192)               = -1 EIO (Input/output error)
shutdown(3, 1 /* send */)               = 0
exit_group(0)                           = ?

_

> sudo strace -p 7517
Process 7517 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[ Process PID=7517 runs in 32 bit mode. ]
connect(3, {sa_family=AF_INET, sin_port=htons(300), sin_addr=inet_addr("100.64.220.98")}, 16) = -1 ETIMEDOUT (Connection timed out)
close(3)                                = 0
dup(2)                                  = 3
fcntl64(3, F_GETFL)                     = 0x1 (flags O_WRONLY)
close(3)                                = 0
write(2, "dsd13: Connection timed out\n", 30) = 30
write(2, "Error code : 110\n", 17)      = 17
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(1)                           = ?
Process 7517 detached

Not just select(), but the processes(of same program) are stuck in various system calls before I attach them to strace. They suddenly resume after attaching to strace. If I don't attach them to strace, they just hang there forever.

Update 2:

I learned that strace could start a process which was previously stopped (process in T sate). Now I am trying to understand why did these processes go to 'T' state, what's the cause. Here's the /proc//status information:

> cat /proc/12554/status
Name:   someone
State:  T (stopped)
SleepAVG:       88%
Tgid:   12554
Pid:    12554
PPid:   9754
TracerPid:      0
Uid:    5000    5000    5000    5000
Gid:    48986   48986   48986   48986
FDSize: 256
Groups: 9149 48986
VmPeak:     1992 kB
VmSize:     1964 kB
VmLck:         0 kB
VmHWM:       608 kB
VmRSS:       608 kB
VmData:      156 kB
VmStk:        20 kB
VmExe:        16 kB
VmLib:      1744 kB
VmPTE:        20 kB
Threads:        1
SigQ:   54/73728
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000000004000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed:   00000000,00000000,00000000,0000000f
Mems_allowed:   00000000,00000001
ernesto
  • 1,899
  • 4
  • 26
  • 39
  • It just looks like a process in T state will start execution when attached to strace. If you quit the strace session while execution, the process will again be moved to T state. – ernesto Feb 26 '14 at 06:22

1 Answers1

4

strace uses ptrace. The ptrace man page has this:

Since attaching sends SIGSTOP and the tracer usually suppresses it,
this may cause a stray EINTR return from the currently executing system
call in the tracee, as described in the "Signal injection and
suppression" section.

Are you seeing select return EINTR?

Mark Plotnick
  • 9,598
  • 1
  • 24
  • 40
  • No, select() properly returns. I updated my question with the strace output. – ernesto Feb 20 '14 at 08:04
  • I've seen some other processes stuck at different places ( connect(), read()...etc). Surprisingly, all these processes resume execution once I attach them to strace. The only thing common among them is they have been running(actually hung) from a month. Would Linux kernel keep aside processes that are not active(no I/O) and never resumes them? – ernesto Feb 20 '14 at 10:56
  • I tried increasing priority with NICE. But NICE has no effect on them, only strace is somehow doing the magic. – ernesto Feb 20 '14 at 10:57
  • I looked at process states of all these hung processes. Some of them are in S+ state and some are in T state. The ones that resume execution upon attaching to strace are the ones in 'T' state. `kill -SIGCONT ` also starts those hung processes in 'T' state. However, I am unclear what sent them into 'T' state. Would the kernel send processes into 'T' state after wiating long time? Without no one ecplicitly issuing `SIGSTOP` signal to send a process into 'T' state? – ernesto Feb 24 '14 at 11:02
  • 'T' state means stopped. It can be caused by another process tracing it (in which case `/proc/pid/status` may show a nonzero `TracerPid`) or because it was sent a SIGTSTP (by typing control-Z), or sent a SIGSTOP, or got a SIGTTOU or SIGTTIN if it tried to do output or input on the terminal when it was in the background. I don't think 'T' state occurs just because a process has been waiting. Does your `select` involve only network sockets, or does it involve terminal devices, too? – Mark Plotnick Feb 24 '14 at 16:03
  • No, the select waits only on to read from two network sockets. I updated my post with proc/status information. From the status information it just appears that the only signal the process received is `SIGTERM` (`SigCgt: 0000000000004000`) . I am not sure how it went to `T` state then. I might be wrong with my interpretation. – ernesto Feb 27 '14 at 06:30
  • @ernesto Rereading this, I think my answer probably should have been a comment, especially since it didn't fix your problem. Could you please uncheck it, and I'll delete it? – Mark Plotnick Aug 23 '19 at 21:00