HtmlUnit getByXpath returns null

Question

I am coding with Groovy, however, I don't believe its a language specific set of questions.

I actually have two questions

First Question

I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.

The page I'm testing it on is: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

My code:

client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false

page = client.getPage(url)

//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")

println title

This simply prints out: []

Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.

Second Question

I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[@id="gmi-ResViewSizer_img"]

How do I handle that?

Mads Hansen · Accepted Answer · 2010-12-01T02:38:51.373

1

First Answer:

/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a

Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.

Maybe something like this:

/html/body//div/h1/a

Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".

There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:

/html/body/div/div/div/div/img[1]

edited Dec 01 '10 at 02:38

answered Dec 01 '10 at 02:26

Mads Hansen

63,927
12
112
147

Thanks again for the explanation Mads Hansen :) You've been quite helpful.The explanation is helpful, however, for the first answer I still seem to be getting an empty return. I think it's having problems with the H1 – StartingGroovy Dec 02 '10 at 20:41
Neither of them work, actually none of the three work. I was trying to look through it yesterday with my script (one by one) but didn't quite seem to find the issue. – StartingGroovy Dec 03 '10 at 20:29
It almost seems as if the script it loading up the page prior to it. (Ie the same link minus the /dwam4) Do you know if there would be any reason it would pull info from the prior page instead of the one specified? – StartingGroovy Dec 04 '10 at 20:55
Good catch. The hash symbol `#` in a URL is a fragment identifier http://en.wikipedia.org/wiki/Fragment_identifier and to identify somewhere within the page(like a named anchor). It is likely that htmlunit is "throwing away" "#/dbwam4" and loading the URL without it when browsers do not. – Mads Hansen Dec 05 '10 at 01:48
Is there any alternative to this issue? I have the option of generating the urls I want based on the original urls but I would prefer not to. – StartingGroovy Dec 09 '10 at 22:50

score 0 · Answer 2 · edited Jan 03 '11 at 16:24

0

I had the same problem, I solved when I realize iframe tags on page, try call

((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...

where n is the position in frame in iframe collection. It's work for me !!!

Thanks a lot.

edited Jan 03 '11 at 16:24

answered Jan 02 '11 at 21:33

metootoo

1
1

1

Your problem dealt with fragment identifiers? – StartingGroovy Jan 03 '11 at 19:07

HtmlUnit getByXpath returns null

2 Answers2

Linked