Get the HTML page using htmlunit

Question

I am trying to get the HTML page of a website (ex http://htmlunit.sourceforge.net) but I get an error of IlleagalArgumentException: Cannot locate declared field class org.apache.http.impl.client.HttpClientBuilder.dnsResolver. My code is as follow:

public class Main1 {
    public static void main(String[] args) {
            try {
                homePage();
            } catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }

    public static void homePage() throws Exception {
            try (final WebClient webClient = new WebClient()) {
                final HtmlPage page = webClient.getPage("http://www.google.com");
                String text = page.asText();
                System.out.println(text);
            }
        }
    }

Is there something wrong with the code? Thanks

@Tugrul yeah I need to parse it actually, I am reading that htmlunit can parse the html and javascript elements of a page. — Ihsan Haikal, Aug 05 '16 at 14:01
It seems alright, it is better to print the stacktrance for us to track what's going on. Maybe you did not set browser version, did not set webClient options so the error exists. — PSo, Aug 08 '16 at 10:19

score 0 · Answer 1 · answered May 25 '19 at 02:05

It's counter-intuitive but we can use asXml() on HtmlPage or HtmlElement to get it as HTML/XML representation.

page.asXml()

The way you wrote the code, it will return a text representation for what would be shown to a used on browser.

May you need to add this to enable JavaScript:

webClient.options.setJavaScriptEnabled(true)

score 0 · Answer 2 · answered May 27 '19 at 14:07

IlleagalArgumentException: Cannot locate declared field class org.apache.http.impl.client.HttpClientBuilder.dnsResolver

This looks like a wrong version of the HttpClient dependency. Please check your classpath to have only one (and only the correct) version of every dependency.

For the current version you can finde a list of dependencies here http://htmlunit.sourceforge.net/dependencies.html

score -2 · Answer 3 · answered Aug 05 '16 at 14:05

-2

You can use jsoup parser.

Little code sample

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Advanced Usage

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

Helpful URLs

answered Aug 05 '16 at 14:05

Tugrul

1,760
4
24
39

1

Jsoup is not able to parse Javascript elements right? What I need is actually something that could get HTML containig Javascript, therefore I am trying htmlunit – Ihsan Haikal Aug 05 '16 at 14:09
If you only need javascripts, just use any web scrapper application for extension of *.js and save files in local storage. Then, parse them whatever you want. – Tugrul Aug 05 '16 at 14:12
I need the parse the real and current page and unfortunately this page that I want is a single page application and will fetch the required elements later by javascript. If I'm using Jsoup then it will only get the background page not the current element that I want. – Ihsan Haikal Aug 05 '16 at 14:18
Ok. Forget about jsoup. – Tugrul Aug 05 '16 at 14:19
Do you know how to get the xml elements of javascript and HTML of the page? Seems like htmlunit is no go as well – Ihsan Haikal Aug 05 '16 at 14:26
What do you mean of xml elements ? You want to say html etc. tags. – Tugrul Aug 05 '16 at 14:29
Yeah basically the HTML tags alongside with the portion of the page that will be fetched later. JSoup is only able to get the basic page not with the portion of page that will be fetched later. – Ihsan Haikal Aug 05 '16 at 14:32
1

@IhsanHaikal See [this related post](https://stackoverflow.com/q/50189638/8583692). – Mahozad Nov 15 '21 at 13:30

Get the HTML page using htmlunit

3 Answers3