Process stuck within SQS calls

Question

I have a python script which just checks for messages on SQS in a loop and then stops. The script is re-started every few minutes by a cron job in case it's not found running.

#start def main():
------For i from 1 to 100:
-------------Check SQS for new message[establish connections to SQS] # long polling not used, Receive message wait time set to 0.
-------------If new job found:
--------------------ProcessIt()
# end

I find that after a few days of run of the script on EC2 instance, the script becomes stale and it doesn't check for any new messages from SQS.

When I ran lsof for the pid of the process, grepping only for SQS connections, I found all connections to SQS are on CLOSE_WAIT. The fix for my issue is to kill and restart the script process manually. So, seems like the cron is not even able to re-start the script because it's already running all the time and stuck in the call to SQS :

ip-10-x-y-z:~ # lsof -p 9018  | grep "72.21"

ld-linux. 9018 root    7u  IPv4 474699439      0t0       TCP ip-10-x-y-z.ec2.internal:58211->72.21.202.145:https (CLOSE_WAIT)

ld-linux. 9018 root   10u  IPv4 474699560      0t0       TCP ip-10-x-y-z.ec2.internal:53428->72.21.194.47:https (CLOSE_WAIT)

ld-linux. 9018 root   12u  IPv4 474701017      0t0       TCP ip-10-x-y-z.ec2.internal:52166->72.21.214.70:https (CLOSE_WAIT)

ld-linux. 9018 root   18u  IPv4 474694555      0t0       TCP ip-10-x-y-z.ec2.internal:57267->72.21.202.145:https (CLOSE_WAIT)

ld-linux. 9018 root   22u  IPv4 474694573      0t0       TCP ip-10-x-y-z.ec2.internal:57271->72.21.202.145:https (CLOSE_WAIT)

ld-linux. 9018 root   39u  IPv4 474701031      0t0       TCP ip-10-x-y-z.ec2.internal:52170->72.21.214.70:https (CLOSE_WAIT)

I know I should use long polling, but still wondering why the process gets stuck and never recovers on it's own. I am using Boto 2.23.

Any inputs will be helpful.

Is this related to the http_socket_timeout boto config in any way ? — ic10503, Jul 24 '14 at 20:12

score 1 · Accepted Answer · edited May 23 '17 at 11:49

gdb debugging led to the following traceback for my stuck process:

(gdb) pystack

~/mypackage/lib/python2.6/ssl.py (293): do_handshake 

~/mypackage/lib/python2.6/ssl.py (120): __init__ 

~/mypackage/lib/python2.6/ssl.py (350): wrap_socket 

~/mypackage/lib/python2.6/site-packages/boto/https_connection.py (118): connect 

~/mypackage/lib/python2.6/httplib.py (725): send 

~/mypackage/lib/python2.6/httplib.py (764): _send_output 

~/mypackage/lib/python2.6/httplib.py (892): endheaders 

~/mypackage/lib/python2.6/httplib.py (937): _send_request 

~/mypackage/lib/python2.6/httplib.py (899): request 

~/mypackage/lib/python2.6/site-packages/boto/connection.py (902): _mexe 

~/mypackage/lib/python2.6/site-packages/boto/connection.py (1063): make_request 

~/mypackage/lib/python2.6/site-packages/boto/connection.py (1138): get_object 

~/mypackage/lib/python2.6/site-packages/boto/sqs/connection.py (355): get_queue 

~/mypackage/lib/python2.6/site-packages/sqs/SQSHelper.py (96): __init__ 

~/mypackage/sqs/SQSWrapper.py (1229): main 

~/mypackage/sqs/SQSWrapper.py (1367): <module>

As we can see my script is stuck at get_queue() API of SQS.

Seems like issue is in ssl's handshake function of python 2.6 which was fixed in python 2.7, but somebody reported the same issue in python 2.7 as well[see links below]. I am going to use Python 2.7 as well as set a timeout of few minutes on the SQS APIs in my SQS Wrapper code to fix the whole issue: Following links helped me boil down to the root cause and the fix:

http://bugs.python.org/issue5103

http://hg.python.org/cpython/rev/ce4916ca06dd/

Web app hangs for several hours in ssl.py at self._sslobj.do_handshake()

Timeout function if it takes too long to finish

Process stuck within SQS calls

1 Answers1