I have a crawler which read a lot of HTML source web page. I get sometimes exception: Connection timed out: connect.
I send a lot of links but the links have the same beginning but different end for example I send links like: "https://stackoverflow.com/questions/ask"
,"https://stackoverflow.com/tags"
and another links with the same beginning "https://stackoverflow.com/..."
and so on.
I tried two methods to read many source code, but both are weak.
METHOD 1:
private static String getUrlSource(String link) throws IOException {
System.setProperty("java.net.useSystemProxies", "true");
URL url = new URL(link);
URLConnection connection = url.openConnection();
String redirect = connection.getHeaderField("Location");
if (redirect != null) {
connection = new URL(redirect).openConnection();
connection.setRequestProperty("User-Agent",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
connection.setUseCaches(false);
connection.setReadTimeout(30000);
;
connection.setDoOutput(true);
connection.setRequestProperty("Content-Type", "application/x-java-serialized-object");
}
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
StringBuilder text = new StringBuilder();
String inputLine;
System.out.println();
while ((inputLine = in.readLine()) != null) {
text.append(inputLine);
// System.out.println(inputLine);
}
in.close();
return text.toString();
}
. METHOD 2
String sourceCode = Jsoup.connect(url).timeout(0).userAgent("Mozilla").get().html();
I have the same error but it rarely appears. Anyone knows why I have this error and how to fix it ? The interesting fact is that when I try send less links using method 2, the problem often is disapear. Also I tried turn off my firewall but it didn't help.