Jsoup crawler and HTTP error fetching URL

Question

I am writing a crawler with Jsoup and this is the HTTP error I get:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:760)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:757)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:706)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:299)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:288)
at testing.DefinitelyNotSpiderLeg.crawl(DefinitelyNotSpiderLeg.java:31)
at testing.DefinitelyNotSpider.search(DefinitelyNotSpider.java:33)
at testing.Test.main(Test.java:9)

I read all the other similar questions and solutions about this error, so I implemented their solutions into my code, but I still get the same error when the Jsoup connects to the url.

This is the method I use for crawling:

public boolean crawl(String url)
{
    try
    {
         Document htmlDocument = Jsoup.connect(url)
                 .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1")
                 .referrer("http://www.google.com")              
                 .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
                 .get();

        Elements linksOnPage = htmlDocument.select("a[href]");

        for(Element link : linksOnPage)
        {    
            String a =link.attr("abs:href");

            if(a.startsWith(url)) {
                this.links.add(a);
            }               
        }            

    }catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return true;

}

Any ideas guys???

I see url in exception is https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/. Is this the url being passed — user3134614, Apr 02 '18 at 11:26
Well, exception says it clearly, server can't find resource for `https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/`. How are you using this method that you end up with such call? — Pshemo, Apr 02 '18 at 11:26
Since its a https connection are you taking care of ssl. https://stackoverflow.com/questions/7744075/how-to-connect-via-https-using-jsoup?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa — utkarsh dubey, Apr 02 '18 at 11:38
I collect all the urls from a webpage : https://www.mkyong.com , and then I "crawl" in each url I collected. I guess than one of these links are the "https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/". — Anna Noukou, Apr 02 '18 at 11:38

score 0 · Answer 1 · answered Apr 02 '18 at 11:27

It is because the url is incorrect:-

In your code you are using url - https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/

i can see at the first line of stack trace

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/

which is not found actually :-)

score 0 · Answer 2 · answered Apr 05 '18 at 09:54

The problem is not the code, the problem is the links present in the webpage you are parsing.

Here is the original page which contains further links. As you are crawling the webpages it will give you all the links. https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/

Now, if you carefully see and examine the page you will get a hyperlink as

and the code present in the hyperlink says-
<a href="“http://wildfly.org/downloads/“" target="“_blank”">official website</a>

if you notice this url is going to create problem as it as extra quotes present in it and hence it is appending this quotes url and the base url togther and the output is-
https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/

which you are getting in JSOUP as

https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/. So to resolve your issue while crawling the web page either you have to do the processing and remove the unnecessary things and separate the required url http:/wildfly.org/downloads/ from the messed up url or handle the failure in the code. Hope it gives you better idea.

Jsoup crawler and HTTP error fetching URL

2 Answers2