0

I am building a web-scraper using Java and JavaFx. I already have an application running using JavaFx.

I am building a web-scraper following similar procedures as this blog: https://ksah.in/introduction-to-web-scraping-with-java/

However, instead of having a fixed url, I want to input any url and scrape. For this, I need to handle the error when the url is not found. Therefore, I need to display "Page not found" in my application console when the url is not found.

Here is my code for the part where I get URL:

    void search() {
            List<Course> v = scraper.scrape(textfieldURL.getText(), textfieldTerm.getText(),textfieldSubject.getText());
...
    }

and then I do:

    try {
                HtmlPage page = client.getPage(baseurl + "/" + term + "/subject/" + sub);
    ...
    }catch (Exception e) {
            System.out.println(e);
}

in the scraper file.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
user13
  • 1
  • 3

2 Answers2

2

It seems that the API will throw FailingHttpStatusCodeException if you set it up correctly.

if the server returns a failing status code AND the property WebClientOptions.setThrowExceptionOnFailingStatusCode(boolean) is set to true.

You can also get the WebResponse from the Page and call getStatusCode() to get the HTTP status code.

Allan
  • 2,889
  • 2
  • 27
  • 38
  • Worth to mention that second option can be used here if only throw exception option was disabled. – devmind Apr 22 '19 at 16:40
1

The tutorial you added contains the following code:

.....
WebClient client = new WebClient();  
client.getOptions().setCssEnabled(false);  
client.getOptions().setJavaScriptEnabled(false);  
try {  
  String searchUrl = "https://newyork.craigslist.org/search/sss?sort=rel&query=" + URLEncoder.encode(searchQuery, "UTF-8");
  HtmlPage page = client.getPage(searchUrl);
}catch(Exception e){
  e.printStackTrace();
}
.....

With this code when client.getPage throws any error, for example a 404, it will be catched and printed to the console.

As you stated you want to print "Page not found", which means we have to catch a specific exception and log the message. The library used in the tutorial is net.sourceforge.htmlunit and as you can see here (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/WebClient.html#getPage-java.lang.String-) the getPage method throws a FailingHttpStatusCodeException, which contains the status code from the HttpResponse. (http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/FailingHttpStatusCodeException.html)

This means we have to catch the FailingHttpStatusCodeException and check if the statuscode is a 404. If yes, log the message, if not, print the stacktrace for example.
Just for the sake of clean code, try not to catch them all (like in pokemon) as in the tutorial but use specific catch-blocks for the IOException, FailingHttpStatusCodeException and MalformedURLException from the getPage method.

  • Thank you for your prompt reply. I just wanted to ask what you mean by "print the stacktrace"? – user13 Apr 22 '19 at 16:29
  • @user13 you see the e.printStackTrace(); in the sample code from the tutorial? That prints the stacktrace from an exception. Also see: https://stackoverflow.com/questions/2560368/what-is-the-use-of-printstacktrace-method-in-java – KeukenkastjeXYZ Apr 22 '19 at 16:35
  • 1
    Thank you for your help! I get what to do now. – user13 Apr 22 '19 at 16:41