Program stuck on an URL, cant move forward and won't time out

Question

My program is a web crawler and its been stuck on a URL that apparently corresponds to a random Chinese site. For some reason its not throwing an exception and the connection is not timing out. I would have thought that these lines would prevent that.

static URLConnection in;
in = curURL.openConnection();
in.setConnectTimeout(2000);
pageSource = new StreamedSource(in);

I'm nearly positive this is the issue, any checks on the heap dump for memory leaks turned up nothing.

Done. Any attempts to visit the url either by curl or browser got me nothing anyway. — Chenab, Jul 08 '13 at 16:44

score 1 · Accepted Answer · answered Jul 08 '13 at 16:58

1

setConnectTimeout() only controls the timeout for establishing the connection. Once it has been started, it can last for a long time (basically until the server closes it). For instance you might be downloading a very large file over a slow link.

One solution would be to add a watchdog thread monitoring the connections and closing those which exceed some time limit.

answered Jul 08 '13 at 16:58

Guillaume

14,306
3
43
40

That sounds like exactly what I need, how should I go about doing that? – Chenab Jul 08 '13 at 16:59
Are you using a 3rd party crawler library? If not you might want to look into existing solutions http://stackoverflow.com/questions/12026978/open-source-java-web-crawler – Guillaume Jul 08 '13 at 17:09
I'm using Jericho html parser, its nice for extracting information from a page but I don't see anything that would help me here. Is there a way to put a time limit on the execution of a step in a loop? – Chenab Jul 08 '13 at 17:19
The job of the crawler is (among other things) to queue, monitor and execute the requests, then it may pass the data to a parser. If you want to keep your code have a look at the accepted answer for this question: http://stackoverflow.com/questions/2758612/executorservice-that-interrupts-tasks-after-a-timeout – Guillaume Jul 08 '13 at 18:07
So it turns out URLConnection has a setReadTimeout method that I failed to notice, I think my problems are solved. – Chenab Jul 08 '13 at 18:44
I missed that one! Much better indeed! – Guillaume Jul 08 '13 at 19:42

Program stuck on an URL, cant move forward and won't time out

1 Answers1