8

HtmlUnit is an awesome Java library that allows you to programatically fill out and submit web forms. I'm currently maintaining a pretty old system written in ASP, and instead of manually filling out this one web form on a monthly basis as I'm required, I'm trying to find a way to maybe automate the entire task because I keep forgetting about it. It's a form for retrieving data gathered within a month. Here's what I've coded so far:

WebClient client = new WebClient();
HtmlPage page = client.getPage("http://urlOfTheWebsite.com/search.aspx");

HtmlForm form = page.getFormByName("aspnetForm");       
HtmlSelect frMonth = form.getSelectByName("ctl00$cphContent$ddlStartMonth");
HtmlSelect frDay = form.getSelectByName("ctl00$cphContent$ddlStartDay");
HtmlSelect frYear = form.getSelectByName("ctl00$cphContent$ddlStartYear");
HtmlSelect toMonth = form.getSelectByName("ctl00$cphContent$ddlEndMonth");
HtmlSelect toDay = form.getSelectByName("ctl00$cphContent$ddlEndDay");
HtmlSelect toYear = form.getSelectByName("ctl00$cphContent$ddlEndYear");
HtmlCheckBoxInput games = form.getInputByName("ctl00$cphContent$chkListLottoGame$0");
HtmlSubmitInput submit = form.getInputByName("ctl00$cphContent$btnSearch");

frMonth.setSelectedAttribute("1", true);
frDay.setSelectedAttribute("1", true);
frYear.setSelectedAttribute("2012", true);
toMonth.setSelectedAttribute("1", true);
toDay.setSelectedAttribute("31", true);
toYear.setSelectedAttribute("2012", true);
games.setChecked(true);
submit.click();

After the click(), I'm supposed to wait for the very same web page to finish reloading because somewhere there is a table that displays the results of my search. Then, when the page is done loading, I need to download it as an HTML file (very much like "Save Page As..." in your favorite browser) because I will scrape out the data to compute their totals, and I've already done that using the Jsoup library.

My questions are: 1. How do I programatically wait for the web page to finish loading in HtmlUnit? 2. How do I programatically download the resulting web page as an HTML file?

I've looked into the HtmlUnit docs already and couldn't find a class that'll do what I need.

Matthew Quiros
  • 13,385
  • 12
  • 87
  • 132

3 Answers3

7

Try with these settings:

webClient.waitForBackgroundJavaScript() or

webClient.waitForBackgroundJavaScriptStartingBefore()

I think you need to mention the browser as well.By default it is using IE.You will get more info from here. HTMLUnit doesn't wait for Javascript

Community
  • 1
  • 1
UVM
  • 9,776
  • 6
  • 41
  • 66
  • I used `waitForBackgroundJavaScript()` instead of forcing my thread to sleep. What do you mean "mention the browser," though--as in when instantiating the `WebClient` object? Also, I forgot to mention that I'm doing this all in Ubuntu, so maybe it's Firefox? – Matthew Quiros Jul 05 '12 at 07:07
  • Could be.But ideally this difference should not be there. – UVM Jul 05 '12 at 07:10
  • @matkiros I think he means you need to try changing the browser by passing `BrowserVersion.FIREFOX_3_6` or any available versions of browsers to the constructor of `WebClient`. – Eng.Fouad Jul 05 '12 at 07:16
  • @Eng.Fouad Right. Well in any case, it doesn't look I have to do that anymore. I was already able to get the page's source by leaving the `WebClient` constructor empty. – Matthew Quiros Jul 05 '12 at 07:21
1

This example might help you. After you click you need to wait for the page to load. Most of the time its a dynamic page that uses java scripts etc. All the overridden methods are there not to overwhelm you with a lot of console messages. You can implement the one you want.

public static void main(String[] args) throws IOException {
        WebClient webClient = gethtmlUnitClient();
        final HtmlPage page = webClient.getPage("YOUR PAGE");
        webClient.waitForBackgroundJavaScript(60000);
        System.out.println(page);

    }

static public WebClient gethtmlUnitClient() {
        WebClient webClient;
        LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log",
                "org.apache.commons.logging.impl.NoOpLog");
        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
        webClient = new WebClient(BrowserVersion.CHROME);
        webClient.setIncorrectnessListener(new IncorrectnessListener() {
            @Override
            public void notify(String arg0, Object arg1) {
            }
        });
        webClient.setCssErrorHandler(new ErrorHandler() {

            @Override
            public void warning(CSSParseException arg0) throws CSSException {
                // TODO Auto-generated method stub

            }

            @Override
            public void fatalError(CSSParseException arg0) throws CSSException {
                // TODO Auto-generated method stub

            }

            @Override
            public void error(CSSParseException arg0) throws CSSException {
                // TODO Auto-generated method stub

            }
        });
        webClient.setJavaScriptErrorListener(new JavaScriptErrorListener() {

            @Override
            public void timeoutError(HtmlPage arg0, long arg1, long arg2) {
                // TODO Auto-generated method stub

            }

            @Override
            public void scriptException(HtmlPage arg0, ScriptException arg1) {
                // TODO Auto-generated method stub

            }

            @Override
            public void malformedScriptURL(HtmlPage arg0, String arg1, MalformedURLException arg2) {
                // TODO Auto-generated method stub

            }

            @Override
            public void loadScriptError(HtmlPage arg0, URL arg1, Exception arg2) {
                // TODO Auto-generated method stub

            }
        });
        webClient.setHTMLParserListener(new HTMLParserListener() {

            @Override
            public void warning(String arg0, URL arg1, String arg2, int arg3, int arg4, String arg5) {
                // TODO Auto-generated method stub

            }

            @Override
            public void error(String arg0, URL arg1, String arg2, int arg3, int arg4, String arg5) {
                // TODO Auto-generated method stub

            }
        });
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        return webClient;

    }
Mark
  • 833
  • 1
  • 9
  • 27
0

How do I programatically download the resulting web page as an HTML file

Try asXml(). Something like:

page = submit.click();
String htmlContent = page.asXml();
File htmlFile = new File("C:/index.html");
PrintWriter pw = new PrintWriter(htmlFile, true);
pw.print(htmlContent);
pw.close();
Eng.Fouad
  • 115,165
  • 71
  • 313
  • 417
  • `asXml()` does work! Do you know anything about waiting for the page to reload though? I tried to make the thread sleep for 30 seconds after my call to `click()` and successfully wrote the result of `asXml()` in an HTML file, but while the ` – Matthew Quiros Jul 05 '12 at 07:01
  • 1
    @matkiros There is no benefit of making a thread to sleep since `click()` is returned immediately with new instance of `HtmlPage` or a subclass ,i.e you need to do: `page = submit.click();` or assign it to a new reference. – Eng.Fouad Jul 05 '12 at 07:14
  • You're right, I did the `page = submit.click()` thing, and it also worked as I wanted it to. Thanks! – Matthew Quiros Jul 05 '12 at 07:18