1

Edit: I have apparently solve the problem forcing the code getting the HTML. The problem I have is that randomly the HTML is not taken. To force that I have added:

                int intento = 0;

                while (document == null) {
                    intento++;
                    System.out.println("Intento número: " + intento);                        
                    document = getHtmlDocument(urlPage);
                }

I am experiencing this random issue. Sometimes it gives me problems when fetching an URL an as it reaches to the timeout the program execution stops. The code:

public static int getStatusConnectionCode(String url) {

    Response response = null;

    try {
        response = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).ignoreHttpErrors(true).execute();
    } catch (IOException ex) {
        System.out.println("Excepción al obtener el Status Code: " + ex.getMessage());
    }
    return response.statusCode();
}   

/**
 * Con este método devuelvo un objeto de la clase Document con el contenido del
 * HTML de la web que me permitirá parsearlo con los métodos de la librelia JSoup
 * @param url
 * @return Documento con el HTML
 */
public static Document getHtmlDocument(String url) {

    Document doc = null;

    try {
        doc = Jsoup.connect(url).userAgent("Mozilla/5.0").timeout(100000).get();
    } catch (IOException ex) {
        System.out.println("Excepción al obtener el HTML de la página" + ex.getMessage());
    }

    return doc;

}

Should I use another method or increase the time out limit? The problem is that the program execution spends more or less 10 hours, and sometimes the problem happens in the URL number 500 another time in the 250...this is nonsense for me...if there is a problem in the link number 250, why if I run another time the program the problem happens in the link number 450 (for example)? I have been thinking that it could be internet problems but it's not.

The solution for another case is not solving my problem: Java JSoup error fetching URL

Thanks in advice.

Community
  • 1
  • 1
JetLagFox
  • 240
  • 4
  • 10
  • 3
    Might be possible that the specific link is down when you tried to access and up again when you verified from browser. Also possible that it might have blocked your request finding it to appear from a bot. There can be multiple reasons and cannot be certain on why it occurs. Coding advice for you is to skip such erroneous links and proceed with next links in execution. You can re run the code later just for those which failed. – Pavan Kumar Jan 23 '17 at 10:06
  • Have you tried using HttpUrlConnection instead? – Steve Smith Jan 23 '17 at 10:09
  • @PavanKumar could be that, but it's a pitty when that happens. In the code I have written if the code connection is not "200" then just write "-" in the prices (as it is parsing prices), and in some cases the URL won't exist. – JetLagFox Jan 23 '17 at 11:20
  • @SteveSmith Where should I use HttpUrlConnection? – JetLagFox Jan 23 '17 at 11:21
  • @JetLagFox HttpUrlConnection will replace both your methods since it will get a response code and can download the response. You will then need to pass the response to Jsoup. Google "HttpUrlConnection example". – Steve Smith Jan 23 '17 at 11:29
  • @SteveSmith In my case do I need the POST request? I suppose that only GET request. – JetLagFox Jan 23 '17 at 11:37
  • @JetLagFox That all depends on what method the server is expecting or what methods it can handle. – Steve Smith Jan 23 '17 at 11:38

0 Answers0