0

I have like pages something like this:

www.foo1.bar
www.foo2.bar
www.foo3.bar
.
.
www.foo100.bar

I am using library jsoup and connecting to each page in the same time with Thread :

Thread matchThread = new Thread(task);
matchThread.start();

Each task, connect to page like this, and parses HTML:

Jsoup.connect("www.fooX.bar").timeout(0).get();

Getting tons of these exceptions:

java.net.ConnectException: Connection timed out: connect
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:388)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:523)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:227)
at sun.net.www.http.HttpClient.New(HttpClient.java:300)
at sun.net.www.http.HttpClient.New(HttpClient.java:317)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:404)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:391)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:157)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:146)

Does jsoup allow only 1 thread simultaneously? Or what am I doing wrong? Any suggestions, of how to connect to my pages faster, since going one by one takes ages.

EDIT:

All 700 threads using this method, maybe this is the problem or something. Can this method handle this amount of threads or it is singleton?

private static Document connect(String url) {
    Document doc = null;
    try {
        doc = Jsoup.connect(url).timeout(0).get();
    } catch (IOException e) {
        System.out.println(url);
    }
    return doc; 
}

EDIT: whole thread code

public class MatchWorker implements Callable<Match>{

private Element element;

public MatchWorker(Element element) {
    this.element = element;
}

@Override
public Match call() throws Exception {
    Match match = null;
            Util.connectAndDoStuff();
    return match;
}

}

MY ALL 700 ELEMENTS:

    Collection<Match> matches = new ArrayList<Match>();
    Collection<Future<Match>> results = new ArrayList<Future<Match>>();

 for (Element element : elements) {
        MatchWorker matchWorker = new MatchWorker(element);
        FutureTask<Match> task = new FutureTask<Match>(matchWorker);
        results.add(task);

        Thread matchThread = new Thread(task);
        matchThread.start();
    }
    for(Future<Match> match : results) {
        try {
            matches.add(match.get());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
Jaanus
  • 16,161
  • 49
  • 147
  • 202
  • Have you double checked that www.fooX.bar is online - e.g. browser can access it or u can telnet it on port 80? – linski Oct 06 '12 at 12:17
  • Yes, I can access it, it basically just a website like any other. – Jaanus Oct 06 '12 at 12:34
  • @Jaanus: I also face this issue most of the times. Were you able to find the solution for this? – Richa Aug 02 '18 at 01:40

2 Answers2

7

I tried this:

    ExecutorService executorService = Executors.newFixedThreadPool(5);
    List<Future<Void>> handles = new ArrayList<Future<Void>>();
    Future<Void> handle;
    for (int i=0;i < 12; i++) {
        handle = executorService.submit(new Callable<Void>() {

            public Void call() throws Exception {
                Document d = Jsoup.connect("http://www.google.hr").timeout(0).get();
                System.out.println(d.title());
                return null;
            }
        });
        handles.add(handle);
    }

    for (Future<Void> h : handles) {
        try {
            h.get();
        } 
        catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    executorService.shutdownNow();

It finishes almost immediatley and prints the correct title. Perhaps you have a firewall issue? ("Connection timed out" means that the server could not be reached at all)

EDIT:

i used JSoup 1.7.1

EDIT^2:

AFAIK, this should prove that there are no issues on relation JSoup - Thread because it is in the end using threads..

EDIT^3:

Also, if you are behind a proxy, here is how you can set proxy settings.

EDIT^4:

public static Document connect(String url) {
    Document doc = null;
    try {
        doc = Jsoup.connect(url).timeout(0).get();
    } catch (IOException ex) {
        ex.printStackTrace();
    }
    return doc;
}

and the call function rewritten:

public Void call() throws Exception {                    
    System.out.println(App.connect("http://www.google.hr").title());
    return null;
}

gives the same result. The only thing I can think of is some implicit static synchronization, but that doesn't make much sense since there is a TimeOut exception :/ pls post thread code

EDIT:

Have to go away for couple of hours. Here all all my three classes rewritten

still work, slower, but work. I would definetly recommend usage of fixed thread pool to improve performance.

But, i think it must be a network issue. Good luck :)

EDIT:

Connection timed out means that the destination server could not be reached at all. AFAIK, this means that the (server never sent / client never recieved) TCP SYN+ACK message.

The first thing one could conclude is that the destination server is not online, but there are more possible causes to this problem, one could be that the destination server is overloaded with requests (in extreme case this is (D)DoS attack).

Currently, you tried parrallel approach - every request is in its own thread:

1) Making seven hundred requests in 700 threads (well not seven hundred acutally but as much as your OS can take)

2) Making seven hundred requests through a thread pool of n<<700 threads

First you could try to put a random sleep form 0 - 10 s in each request

Thread.currentThread.sleep(new Random().nextInt(10000)) 

But given the results so far, that probably won't work. The next thing is to replace the parallel with sequential approach as you mentioned in comment - every request is run one after another from inside the for loop of a single main thread. You could also try to put the random sleep.

That's the gentliest(slowest) way you can go and if that doesn't work I don't know how to solve that :(

EDIT:

By using a thread pool of 5 threads I downloaded the titles of 1141 soccer matches succesfully.

It is logical for a site of that kind to protect their data, so most likley while you where developing and testing (running with as many threads as you can summon repeatedly) their system identified you(r IP) as a crawler who wants all their data and they obviously don't like nor want that, so they banned you. They just decided not even to refuse your request but to play dead - hence the connection timeout. That makes sense now. Phew :)

If this is correct you should be able to get the data thorugh a proxy but be polite and use a thread pool of < 10 threads :)

Community
  • 1
  • 1
linski
  • 5,046
  • 3
  • 22
  • 35
  • No I have the latest jsoup and not behind proxy, I have exactly 765 sites to go thorugh: each one is different. My threads are using the same static method for connecting, maybe that causes problems, see updates first page. – Jaanus Oct 06 '12 at 13:36
  • can you post the entire working thread code? i tried the with the static method, see edit soon – linski Oct 06 '12 at 13:43
  • i put all thread code to first post... – Jaanus Oct 06 '12 at 13:50
  • Okay it is better, when slowing the threads down, and not creating 700 threads, but still getting `java.net.ConnectException: Connection timed out: connect` on some occasions. – Jaanus Oct 06 '12 at 16:59
  • I'd try sniffing the traffic with something like wireshark :/ – linski Oct 06 '12 at 17:54
  • linksi, when I put the method `connect` to loop, like...go through 10 times, until there is no exception, then it finally connects to it, but sometimes must like retry 5 times. – Jaanus Oct 07 '12 at 11:29
  • see update. Is it appropriate to ask you to give me these URL's so that I could try from my part of the network? If I succeed it means that the problem is not in the destination server but somewhere in your route to the server. If i fail than it is *probably* the destination server (firewall,IDS?).. – linski Oct 07 '12 at 12:28
  • http://www.betexplorer.com/results/ , there are like hundreds of games, I go thru all links. and they work nicely ...but when I try to connect with 150 threads simultanously. then there comes problem. – Jaanus Oct 07 '12 at 13:09
  • as your post, maybe when i try to connect with hundreds of threads, then it thinks im ddosing site. – Jaanus Oct 07 '12 at 13:12
  • see update. please consider upvote&accept if that is possible since it is closed. thank u. – linski Oct 07 '12 at 14:02
0

Jsoup is just calling upon HTTPUrlConnection to connect. It has no influence as to whether that connection fails. The things that could influence it are firewalls and DDOS-prevention devices. It's not surprising that a web site rejects a flood of simultaneous connections.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • but, shouldn't he get first few ok, and then connection refused and not connection timed out? – linski Oct 06 '12 at 14:11
  • Well, I could imagine a very clever google not acking until it sees if the syn is part of a flock, and refusing them all. – bmargulies Oct 06 '12 at 16:58
  • yes i realize that, but tell me is this correct: when the first SYN gets ***refused*** wouldn't the exception thrown be Connection ***Refused***? is the SYN+ACK is never recieved then the connection thrown should be Connection ***Timeout***? – linski Oct 06 '12 at 17:52
  • Mostly, I've read the source of jsoup, and it has no possibly telepathic influence that could produce those exceptions. – bmargulies Oct 06 '12 at 17:53