0

I have a crawler which read a lot of HTML source web page. I get sometimes exception: Connection timed out: connect.

I send a lot of links but the links have the same beginning but different end for example I send links like: "https://stackoverflow.com/questions/ask","https://stackoverflow.com/tags" and another links with the same beginning "https://stackoverflow.com/..." and so on.

I tried two methods to read many source code, but both are weak.

METHOD 1:

private static String getUrlSource(String link) throws IOException {
    System.setProperty("java.net.useSystemProxies", "true");
    URL url = new URL(link);
    URLConnection connection = url.openConnection();
    String redirect = connection.getHeaderField("Location");
    if (redirect != null) {
        connection = new URL(redirect).openConnection();
        connection.setRequestProperty("User-Agent",
                "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
        connection.setUseCaches(false);
        connection.setReadTimeout(30000);
        ;
        connection.setDoOutput(true);
        connection.setRequestProperty("Content-Type", "application/x-java-serialized-object");
    }
    BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
    StringBuilder text = new StringBuilder();
    String inputLine;
    System.out.println();
    while ((inputLine = in.readLine()) != null) {
        text.append(inputLine);
        // System.out.println(inputLine);
    }
    in.close();
    return text.toString();
}

. METHOD 2

 String sourceCode = Jsoup.connect(url).timeout(0).userAgent("Mozilla").get().html();

I have the same error but it rarely appears. Anyone knows why I have this error and how to fix it ? The interesting fact is that when I try send less links using method 2, the problem often is disapear. Also I tried turn off my firewall but it didn't help.

grap
  • 45
  • 1
  • 9
  • "_I have the same error but it rarely appears_" but when it appears, what does it look like (add the stacktrace and tell the line in cause) ? All I can say is that you have a `connectionTimeOut`, but only set a `readTimeout` – AxelH Sep 13 '17 at 10:02
  • It appears randomly. There is no dependency that a given link generates an error. The error generate this line: `String sourceCode = Jsoup.connect(url).timeout(0).userAgent("Mozilla").get().html();` – grap Sep 13 '17 at 10:06

0 Answers0