-2

I'm trying to fetch an entire webpage so I can extract some data. I'm using(or trying to use) HtmlUnit.

The result I want to get is the ENTIRELY generated code that being produced from all sources. I don't want the source code. I want a result like the 'inspect element' window in chrome. Any ideas? Is this even possible? Should I use another library?

I'm posting a sample code that DIDN'T help me.

webClient = new WebClient(BrowserVersion.CHROME);
final HtmlPage page = webClient.getPage("https://www.bet365.com");
System.out.println(page.asXml());
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Alator
  • 497
  • 6
  • 23

2 Answers2

1

If you mean to extract all the data from a websites server/database (which is what it sounds like) then it isn't possible because those files are protected.

If you just want source code, try this solution How do you Programmatically Download a Webpage in Java

Ashrant Kohli
  • 73
  • 1
  • 9
  • I want a result like the 'inspect element' window in chrome.Not the source code.I want the result i see in my browser. – Alator Jun 09 '17 at 09:52
  • Well, actually `.asXml()` is not the source as returned from the server, it is as what you would see in the browser. If they don't match, then it is possibly a bug. – Ahmed Ashour Jun 09 '17 at 09:54
  • They dont.Should i report it as a bug? – Alator Jun 09 '17 at 09:58
  • When you say the result you see in the browser you're referring to the code right? Or do you want to display the site as it would be rendered in HTML? – Ashrant Kohli Jun 09 '17 at 10:04
  • For example i see this in my browser http://imgur.com/a/xUXa0 .But in the code i get from the method i dont see the word Brazil anywhere. – Alator Jun 09 '17 at 10:09
  • 1
    Okay I get it. So what you're trying to do is download the dynamically generated content. This is done by the javascript once the page has been loaded, but downloading source will only give you the HTML. Try this out https://askubuntu.com/questions/411540/how-to-get-wget-to-download-exact-same-web-page-html-as-browser This one involves using a different browser, and here's a fix for python if you want you can dump the python data to a .txt file and read it in through java. https://dvenkatsagar.github.io/tutorials/python/2015/10/26/ddlv/ – Ashrant Kohli Jun 09 '17 at 10:29
0

page.getWebResponse().getContentAsString() returns the content as returned from the server.

page.asXml() returns the XHTML of the page, after JavaScript modifications.

page.save(File) saves the page recursively with dependencies.

You can also extract all sources returned from the web server by intercepting the request/response:

new WebConnectionWrapper(webClient) {

    public WebResponse getResponse(WebRequest request) throws IOException {
                WebResponse response = super.getResponse(request);
        if (request.getUrl().toExternalForm().contains("my_url")) {
            String content = response.getContentAsString();

            // change or save content

            WebResponseData data = new WebResponseData(content.getBytes(),
                response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
            response = new WebResponse(data, request, response.getLoadTime());
        }
        return response;
    }
};
Ahmed Ashour
  • 5,179
  • 10
  • 35
  • 56
  • well with the .asXml() method im getting a different result but still not what i want.I want the code of what i see in browser. E.G in browser i see a certain word and it doesnt exist in the result im getting from the method. – Alator Jun 09 '17 at 09:59
  • Then [wait](http://htmlunit.sourceforge.net/faq.html#AJAXDoesNotWork) a little, or provide a details of what is wrong, hopefully a [minimal](http://htmlunit.sourceforge.net/submittingJSBugs.html) test case. – Ahmed Ashour Jun 09 '17 at 10:01