1

I started using HtmlUnit today, so I'm a bit noob at the time.

I've managed to to go to IMDB and search for the movie "Sleepers" from 1996, and I get a bunch of results with the same name:

Here are the results from that search

I want to select the first "Sleepers" from the list, which is the correct one, but I don't know how to get that information with HtmlUnit. I looked inside the code and found the link, but I don't know how to extract it.

I guess i could use some regex, but that would defeat the purpose of using HtmlUnit.

This is my code (It has some bits from HtmlUnit's tutorial and some code found here):

public IMdB() {
    try {
        //final WebClient webClient = new WebClient();

        final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8, "10.255.10.34", 8080);

        //set proxy username and password 
        final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
        credentialsProvider.addCredentials("xxxx", "xxxx");

        // Get the first page
        final HtmlPage page1 = webClient.getPage("http://www.imdb.com");

        // Get the form that we are dealing with and within that form, 
        // find the submit button and the field that we want to change.
        //final HtmlForm form = page1.getFormByName("navbar-form");
        HtmlForm form = page1.getFirstByXPath("//form[@id='navbar-form']");

        //
        HtmlButton button = form.getFirstByXPath("/html/body//form//button[@id='navbar-submit-button']");            
        HtmlTextInput textField = form.getFirstByXPath("/html/body//form//input[@id='navbar-query']");

        // Change the value of the text field
        textField.setValueAttribute("Sleepers");

        // Now submit the form by clicking the button and get back the second page.
        HtmlPage page2 = button.click();

       // form = page2.getElementByName("s");

        //page2 = page2.getFirstByXPath("/html/body//form//div//tr[@href]");

        System.out.println("content: " + page2.asText());

        webClient.closeAllWindows();
    } catch (IOException ex) {
        Logger.getLogger(IMdB.class.getName()).log(Level.SEVERE, null, ex);
    }

    System.out.println("END");
}
Mosty Mostacho
  • 42,742
  • 16
  • 96
  • 123
Jh62
  • 324
  • 1
  • 3
  • 15

2 Answers2

1

You should do that this way:

HtmlPage htmlPage = new WebClient().getPage("http://imdb.com/blah");
HtmlAnchor anchor = htmlPage.getFirstByXPath("//td[@class='primary_photo']//a")
System.out.println(anchor.getHrefAttribute());
Mosty Mostacho
  • 42,742
  • 16
  • 96
  • 123
  • Thanks. I'll try this. I've managed to extract some specific data using regex, but I think HtmlUnit has some tools for this type of thing. – Jh62 Sep 03 '13 at 03:54
  • How i would extract the "Sleepers" part from this: ` Sleepers (1996)
    `? My program is working fine. It finds the movie, cast, nominations, rating, etc... but through regex.
    – Jh62 Sep 03 '13 at 04:06
  • Regex is clearly NOT the way to go. Check this [question](http://stackoverflow.com/questions/1732348) and the first answer. I'd recommend you use XPath. You'll find many tutorials on google. – Mosty Mostacho Sep 03 '13 at 13:34
  • Thanks, I'll look into that, but I think regex (even though is not the best way to do it) works well for simple things. I've already have my program working and fetching information with regex (until i learn how to use xpath9). – Jh62 Sep 06 '13 at 05:29
0

I would suggest you to rather use the IMDB api then doing all that

The IMDb currently has two public APIs that are, although undocumented, very quick and reliable (used on their own site through AJAX).

  1. A statically cached search suggestions API:

  2. More advanced search

dirtydexter
  • 1,063
  • 1
  • 10
  • 17