Edit: I have apparently solve the problem forcing the code getting the HTML. The problem I have is that randomly the HTML is not taken. To force that I have added:
int intento = 0;
while (document == null) {
intento++;
System.out.println("Intento número: " + intento);
document = getHtmlDocument(urlPage);
}
I am experiencing this random issue. Sometimes it gives me problems when fetching an URL an as it reaches to the timeout the program execution stops. The code:
public static int getStatusConnectionCode(String url) {
Response response = null;
try {
response = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).ignoreHttpErrors(true).execute();
} catch (IOException ex) {
System.out.println("Excepción al obtener el Status Code: " + ex.getMessage());
}
return response.statusCode();
}
/**
* Con este método devuelvo un objeto de la clase Document con el contenido del
* HTML de la web que me permitirá parsearlo con los métodos de la librelia JSoup
* @param url
* @return Documento con el HTML
*/
public static Document getHtmlDocument(String url) {
Document doc = null;
try {
doc = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).get();
} catch (IOException ex) {
System.out.println("Excepción al obtener el HTML de la página" + ex.getMessage());
}
return doc;
}
Should I use another method or increase the time out limit? The problem is that the program execution spends more or less 10 hours, and sometimes the problem happens in the URL number 500 another time in the 250...this is nonsense for me...if there is a problem in the link number 250, why if I run another time the program the problem happens in the link number 450 (for example)? I have been thinking that it could be internet problems but it's not.
The solution for another case is not solving my problem: Java JSoup error fetching URL
Thanks in advice.