104

I get a SocketTimeoutException when I try to parse a lot of HTML documents using Jsoup.

For example, I got a list of links :

<a href="www.domain.com/url1.html">link1</a>
<a href="www.domain.com/url2.html">link2</a>
<a href="www.domain.com/url3.html">link3</a>
<a href="www.domain.com/url4.html">link4</a>

For each link, I parse the document linked to the URL (from the href attribute) to get other pieces of information in those pages.

So I can imagine that it takes lot of time, but how to shut off this exception Here is the whole stack trace:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read1(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at java.net.HttpURLConnection.getResponseCode(Unknown Source)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
    at app.ForumCrawler.crawl(ForumCrawler.java:50)
    at Main.main(Main.java:15)
double-beep
  • 5,031
  • 17
  • 33
  • 41
C. Maillard
  • 1,041
  • 2
  • 7
  • 4
  • 3
    The code you added in your edit sets the timeout to infinite. This is undesirable in most use cases. It is much better to use a specific timeout as indicated in MarcoS answer, even if the timeout is long. – stepanian Dec 20 '14 at 22:44
  • 2
    I guess the `timeout(0)` will make Jsoup connect the url again and again until it connect. – Evan Hu Jul 12 '15 at 02:45
  • This seems to be a solution found by Question author [C. Maillard](https://stackoverflow.com/users/817143) `Jsoup.connect(url).timeout(0).get();` as per [earlier revision](https://stackoverflow.com/revisions/6571548/6) – Scratte Sep 13 '20 at 11:50

6 Answers6

147

I think you can do

Jsoup.connect("...").timeout(10 * 1000).get(); 

which sets timeout to 10s.

frogatto
  • 28,539
  • 11
  • 83
  • 129
MarcoS
  • 13,386
  • 7
  • 42
  • 63
  • 5
    121 upvotes but no explanation of why this fixes the issue? Why does that address the problem when the default is, it would appear, 30 seconds? – Alan Hay Nov 02 '17 at 09:17
  • 2
    @AlanHay my answer was suggesting to solve the problem by setting a timeout, not by using that specific value as a timeout :) – MarcoS Nov 03 '17 at 15:32
  • 2
    @AlanHay, the default timeout when the Q & A were written was 3 seconds. So any increase would have lowered the socket timeout frequency and helped to fix the issue. I updated the default to 30 seconds in 2016. – Jonathan Hedley Jan 09 '21 at 21:27
27

Ok - so, I tried to offer this as an edit to MarcoS's answer, but the edit was rejected. Nevertheless, the following information may be useful to future visitors:

According to the javadocs, the default timeout for an org.jsoup.Connection is 30 seconds.

As has already been mentioned, this can be set using timeout(int millis)

Also, as the OP notes in the edit, this can also be set using timeout(0). However, as the javadocs state:

A timeout of zero is treated as an infinite timeout.

amaidment
  • 6,942
  • 5
  • 52
  • 88
  • 3
    Setting an infinite timeout is a bad idea in most cases. Use a long timeout, but always specify one. See MarcoS answer. – stepanian Dec 20 '14 at 22:46
  • 3
    @stepanian - to be clear, I'm not advocating setting an infinite timeout. This had been suggested as the solution by the OP, although I wanted to direct future users to the implications of this. Indeed, when I originally posted my 'answer', I indicated that I thought it should have been an edit to MacroS's answer, as there was some additional information that might be useful to future users... but the edit was rejected. – amaidment Jan 06 '15 at 18:51
  • The default timeout is not 3 seconds, but 30 seconds (30000 millis), you can see it in https://jsoup.org/apidocs/org/jsoup/Connection.html – aldok Mar 04 '17 at 13:34
  • The timeout used to be 3 seconds, back when the question was written. – Jonathan Hedley Jan 09 '21 at 21:28
4

I had the same error:

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)

and only setting .userAgent(Opera) worked for me.

So I used Connection userAgent(String userAgent) method of Connection class to set Jsoup user agent.

Something like:

Jsoup.connect("link").userAgent("Opera").get();
invzbl3
  • 5,872
  • 9
  • 36
  • 76
3

There is mistake on https://jsoup.org/apidocs/org/jsoup/Connection.html. Default timeout is not 30 seconds. It is 3 seconds. Just look at javadoc in codes. It says 3000 ms.

Bartek
  • 45
  • 1
  • 1
    On java doc: "The default timeout is 30 seconds (30,000 millis). A timeout of zero is treated as an infinite timeout." https://jsoup.org/apidocs/org/jsoup/Connection.html – jeton May 22 '18 at 16:22
-1

This should work: Jsoup.connect(url.toLowerCase()).timeout(0);.

Masoud Rahimi
  • 5,785
  • 15
  • 39
  • 67
-6

Set timeout while connecting from jsoup.

Gaurab Pradhan
  • 281
  • 1
  • 5
  • 14