We are seeing suck sendmail processes when we are attempting to send email from a Perl FCGI process. These processes are taking too long, hours to a day, since it should just be doing a relay to a server configured in sendmail as the smart host. Most of the mail from the FCGI processes takes less than 5 seconds. The slow sendmail processes are easy to find on the our servers with $ ps -ef | grep sendmail
Almost all of the email works normally from these web nodes. I'd guess thousands of mails go through with no problem. Sending test email from the command line goes smoothly. The sendmail command gets stuck rarely and we don't have a way to reproduce it.
It seems that most of this stuck email gets through sooner or later. These seem to be sending mail hours later, sometimes over a day later.
All of the sendmail that we've seen stuck has been a command that was run by a Perl process, which is a child process of a FCGI process.
Looking at the logs of the smart host we see that most of this mail does get through sooner or later but we have found some that don't seem to have ever been sent.
This is running in FCGI for Catalyst and then added to a IO::Async::Loop which does some processing, and in the IO::Async::Loop, Email::Sender::Transport::Sendmail is used which does a open($fh, '|-', @args) and pipes the mail header+body and does a close($fh).
I've seen this http://perldoc.perl.org/perlipc.html#Avoiding-Pipe-Deadlocks but don't know how to apply it in this situation. The child sendmail has only STDIN open.
When we have one of these stuck sendmails the sendmail is waiting on STDIN:
[<ffffffff8119ce8b>] pipe_wait+0x5b/0x80
[<ffffffff8119d8ad>] pipe_read+0x34d/0x4d0
[<ffffffff8119204a>] do_sync_read+0xfa/0x140
[<ffffffff81192945>] vfs_read+0xb5/0x1a0
[<ffffffff81192c91>] sys_read+0x51/0xb0
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
and the async perl process is waiting on the child to die:
#0 0x00007f8849e6065e in waitpid () from /lib64/libc.so.6
#1 0x000000000046dc2d in Perl_wait4pid ()
#2 0x000000000046de2d in Perl_my_pclose ()
#3 0x00000000004cec4e in Perl_io_close ()
#4 0x00000000004ceda8 in Perl_do_close ()
#5 0x00000000004c2629 in Perl_pp_close ()
#6 0x00000000004804de in Perl_runops_standard ()
#7 0x000000000042e7ad in perl_run ()
#8 0x000000000041bbc5 in main ()
An example of one that didn't get through: Job #1653576 (that's just our internal job number) has a sendmail process that started on Aug 19 13:04.
Process on webnode2:
fcgi-user 13621 13466 0 13:04 ? 00:00:00 /usr/sbin/sendmail -i -f admin@ourServer.org -- proffunnyhat@mit.edu
I don't see the record I expect to see on our smart host for this in /var/log/maillog that would indicate that it was relayed to nexus and then to MIT.
I do see successful email for proffunnyhat@mit.edu on Aug 21 (from web2 /var/log/maillog):
Aug 21 00:00:02 node-008 sendmail[13621]: u7JH4tbr013621: to=proffunnyhat@mit.edu, ctladdr=admin@ourServer.org (10520/10520), delay=1+10:55:07, xdelay=00:00:01, mailer=relay, pri=32292, relay=[127.0.0.1] [127.0.0.1], dsn=2.0.0, stat=Sent (u7L401Z1026237 Message accepted for delivery)
Aug 21 00:00:02 node-008 sendmail[26247]: u7L401Z1026237: to=<proffunnyhat@mit.edu>, delay=00:00:01, xdelay=00:00:00, mailer=relay, pri=122657, relay=mail.ourServer.org. [128.84.4.11], dsn=2.0.0, stat=Sent (u7L402jx001185 Message accepted for delivery)
and then on mail.ourServer.org:
bdc34 @mail.ourServer.org: log$ sudo grep u7L402jx001185 maillog*
maillog-20160821:Aug 21 00:00:02 web2 sendmail[1185]: u7L402jx001185: from=<admin@ourServer.org>, size=2874, class=0, nrcpts=1, msgid=<201608191704.u7JH4tbr013621@mail.ourServer.org>, proto=ESMTP, daemon=MTA, relay=mail.ourServer.org [128.84.4.13]
maillog-20160821:Aug 21 00:00:03 mail.ourServer.org[1200]: u7L402jx001185: to=<proffunnyhat@mit.edu>, ctladdr=<e-admin@ourServer.org> (10519/10519), delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=122874, relay=dmz-mailsec-scanner-8.mit.edu. [18.7.68.37], dsn=2.0.0, stat=Sent (OK 5E/2D-20045-34729B75)
An example of one that was stuck but seems to have been sent: mail.ourServer.org:/var/log/sendmail:
Aug 19 02:19:51 mail.ourServer.org sendmail[20792]: u7J6JlP6020790: to=<jxjx@connect.ust.hk>, ctladdr=<admin@ourServer.org> (10519/10519), delay=00:00:04, xdelay=00:00:04, mailer=esmtp, pri=122504, relay=connect-ust-hk\
.mai...ction.outlook.com. [213.199.154.87], dsn=2.0.0, stat=Sent (<201608190619.u7J6Jlda000738@web2.ourServer.org> [InternalId=15526306777069,...1MB1197.apcprd01.prod.exchangelabs.com] 9137 bytes in 0.189, 47.082 KB/sec\
Queued mail for delivery)
Things we have tried
I've modified Email::Sender::Transport::Sendmail to send a '\x00' to the pipe, that didn't work.
I've replaced IO::Async::Loop::Poll with IO::Async::Loop::Select. That didn't change anything.
I've tried sending signals to the sendmail and its parent. That killed them but the mail was aborted.
Added our fcgi user to sendmail's trusted users file. Didn't change anything.
I wrote a wrapper script that read from STDIN and writes to sendmail. If nothing comes in on STDIN for 5 seconds it exits. This feels really hacky to me but it does seem to work. Since mail is a critical part of our system I'd rather have a real solution.