I have a python script which just checks for messages on SQS in a loop and then stops. The script is re-started every few minutes by a cron job in case it's not found running.
#start def main():
------For i from 1 to 100:
-------------Check SQS for new message[establish connections to SQS] # long polling not used, Receive message wait time set to 0.
-------------If new job found:
--------------------ProcessIt()
# end
I find that after a few days of run of the script on EC2 instance, the script becomes stale and it doesn't check for any new messages from SQS.
When I ran lsof for the pid of the process, grepping only for SQS connections, I found all connections to SQS are on CLOSE_WAIT. The fix for my issue is to kill and restart the script process manually. So, seems like the cron is not even able to re-start the script because it's already running all the time and stuck in the call to SQS :
ip-10-x-y-z:~ # lsof -p 9018 | grep "72.21" ld-linux. 9018 root 7u IPv4 474699439 0t0 TCP ip-10-x-y-z.ec2.internal:58211->72.21.202.145:https (CLOSE_WAIT) ld-linux. 9018 root 10u IPv4 474699560 0t0 TCP ip-10-x-y-z.ec2.internal:53428->72.21.194.47:https (CLOSE_WAIT) ld-linux. 9018 root 12u IPv4 474701017 0t0 TCP ip-10-x-y-z.ec2.internal:52166->72.21.214.70:https (CLOSE_WAIT) ld-linux. 9018 root 18u IPv4 474694555 0t0 TCP ip-10-x-y-z.ec2.internal:57267->72.21.202.145:https (CLOSE_WAIT) ld-linux. 9018 root 22u IPv4 474694573 0t0 TCP ip-10-x-y-z.ec2.internal:57271->72.21.202.145:https (CLOSE_WAIT) ld-linux. 9018 root 39u IPv4 474701031 0t0 TCP ip-10-x-y-z.ec2.internal:52170->72.21.214.70:https (CLOSE_WAIT)
I know I should use long polling, but still wondering why the process gets stuck and never recovers on it's own. I am using Boto 2.23.
Any inputs will be helpful.