1

My program is a web crawler and its been stuck on a URL that apparently corresponds to a random Chinese site. For some reason its not throwing an exception and the connection is not timing out. I would have thought that these lines would prevent that.

static URLConnection in;
in = curURL.openConnection();
in.setConnectTimeout(2000);
pageSource = new StreamedSource(in);

I'm nearly positive this is the issue, any checks on the heap dump for memory leaks turned up nothing.

Chenab
  • 93
  • 8

1 Answers1

1

setConnectTimeout() only controls the timeout for establishing the connection. Once it has been started, it can last for a long time (basically until the server closes it). For instance you might be downloading a very large file over a slow link.

One solution would be to add a watchdog thread monitoring the connections and closing those which exceed some time limit.

Guillaume
  • 14,306
  • 3
  • 43
  • 40
  • That sounds like exactly what I need, how should I go about doing that? – Chenab Jul 08 '13 at 16:59
  • Are you using a 3rd party crawler library? If not you might want to look into existing solutions http://stackoverflow.com/questions/12026978/open-source-java-web-crawler – Guillaume Jul 08 '13 at 17:09
  • I'm using Jericho html parser, its nice for extracting information from a page but I don't see anything that would help me here. Is there a way to put a time limit on the execution of a step in a loop? – Chenab Jul 08 '13 at 17:19
  • The job of the crawler is (among other things) to queue, monitor and execute the requests, then it may pass the data to a parser. If you want to keep your code have a look at the accepted answer for this question: http://stackoverflow.com/questions/2758612/executorservice-that-interrupts-tasks-after-a-timeout – Guillaume Jul 08 '13 at 18:07
  • So it turns out URLConnection has a setReadTimeout method that I failed to notice, I think my problems are solved. – Chenab Jul 08 '13 at 18:44
  • I missed that one! Much better indeed! – Guillaume Jul 08 '13 at 19:42