1

I get a SocketTimeoutException when I try to parse a HTML page using Jsoup:

java.net.SocketTimeoutException: Read timed out
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(Unknown Source)
  at java.io.BufferedInputStream.fill(Unknown Source)
  at java.io.BufferedInputStream.read1(Unknown Source)
  at java.io.BufferedInputStream.read(Unknown Source)
  at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
  at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
  at java.net.HttpURLConnection.getResponseCode(Unknown Source)
  at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
  at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
  at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
  at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
  at app.ForumCrawler.crawl(ForumCrawler.java:50)
  at Main.main(Main.java:15)

I have used this part of code for parsing the page, since I look for the response like 200,404 etc..

String userAgent = "Mozilla/5.0 (jsoup)";
int timeout = 5 * 1000;
Document localDoc = null;
String url = "<url>";
Connection.Response response = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).execute();
if(response.statusCode() == 200) {
    localDoc = Jsoup.parse(response.body());
    //do the stuff..
}

I have come across that if we use .get() instead of .execute() we can get rid of SocketTimeoutException issue, but if I use .get() then I wasn't able to get response.

please suggest me which one to use to get rid of SocketTimeoutException and also to get response when it is trying to parse the page.

Thanks in advance.

Vitalii Elenhaupt
  • 7,146
  • 3
  • 27
  • 43
Vinodh Velumayil
  • 742
  • 3
  • 8
  • 24
  • Try `timeout(10*1000).get();`, does it work ? Otherwise you could use standard Java classes to download the HTML from the url as a string, and then parse with Jsoup. (see http://stackoverflow.com/questions/4328711/read-url-to-string-in-few-lines-of-java-code). What is your url ? – Jonas Czech May 29 '15 at 06:49
  • yep @JonasCz I have used that.. but how can I get the response if I use get(). please help me with that if you know how to get response status if I use get() method – Vinodh Velumayil May 29 '15 at 06:52
  • 1
    Try `localDoc = response.parse();`, instead of `localDoc = Jsoup.parse(response.body());`. Missed that you need the response code, sorry. With `.get()` you cannot have response code. Otherwise, try the standard classes to download as a String, an then parse. – Jonas Czech May 29 '15 at 06:58

1 Answers1

0

I have come across that if we use .get() instead of .execute() we can get rid of SocketTimeoutException issue

As you can see in the callstack located in your post, the get() method calls internally the execute() method:

at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)<-- execute() called
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)<-- ...from get().
at app.ForumCrawler.crawl(ForumCrawler.java:50)

So you can't get rid of SocketTimeoutException. However, you can strenghten your code by handling this exception.

Stephan
  • 41,764
  • 65
  • 238
  • 329