How to detect unresponsive/frozen processes?

Question

I have several scripts that I use to do some web crawling. They are always running, and should never stop. However, after about a week, they systematically "freeze": there is no output anymore, no response to Ctrl+C or anything. The only way is to kill the process and restart it.

I suspect that these issues come from the library I use for retrieving the data (urllib2), but the issue is very hard to reproduce.

I am thus wondering how I could check the state of the process and kill/restart it automatically if it is frozen. I was thinking of creating a PID file, and update it regularly. Another script could then periodically check the last modification date of this PID file, and restart the process if it's too old. I could use something like Monit to do the monitoring.

Is this how I should do it? Is there another best practice/common way for checking the responsiveness of a process?

In the spirit of just doing the simplest thing couldn't you just have a shell script that calls you Python script forever and the Python script finishes after 'n' crawls. It might not solve the underlying problem but might allow you to spend more effort on analysing your crawled data. — sotapme, Feb 26 '13 at 10:13
I agree, that sort of thing would work, and this is what I will do if I don't find a good solution soon. But I feel I could do better ;). — Wookai, Feb 26 '13 at 10:14
I'm not quite sure but I think these processes have the state "D" (man ps). Couldn't you make a cron that check if the given process have the state D or not? — Antoine Pinsard, Feb 26 '13 at 10:16
Good point. I'll check the state next time this happens. If it is the case, then your solution would work. — Wookai, Feb 26 '13 at 10:17
I wonder what state is your crawl in when the process freezes has it processed 525 of 1200 links and then on restart you have to purge those 525 links and restart the crawl again for that site; I suppose also you'll want to be able to try and do something like http://stackoverflow.com/a/133384/1481060 so that it gives you a clue as to where it's stuck. — sotapme, Feb 26 '13 at 10:40
Thanks for the pointer, that looks nice. For my crawls, I restart from scratch on failure, it's not a problem. — Wookai, Feb 26 '13 at 10:49

score 2 · Accepted Answer · answered Feb 26 '13 at 23:41

If you have a process that is always running, has no connected terminal, and is the process group leader - that is a daemon. You undoubtedly know all that.

There are some defacto practices in coding programs like that. One is to have a signal handler which takes SIGHUP and forces the program to reinitialize itself. This means closing all of the open log files, rereading config scripts, etc. I do not know how applicable that is to your problem but it sometimes solves issues like frozen daemons at my work.

You can customize the idea by employing SIGUSR1 and SIGUSR2 signals to do special things, like write status to a file, or anything else. Since signals come in on an interrupt, the trap statement in scripts and signal handlers in python itself will push program state onto the interrupt stack and do "stuff". In your case you may want the program fork/exec itself and then kill the parent.

How to detect unresponsive/frozen processes?

1 Answers1