4

I'm having what seems like a really simple problem. I'm trying to navigate to an element in HTML by Xpath, and can't seem to get it to function properly.

I want to grab a span from the html contents of a page. The page is fairly complex, so I've been using Firebug's "get element by xpath" and pasting the result into my code. I've noticed it's slightly different than the xpath you get from doing the same thing in Chrome, but they both seem to direct to the same place.

The html I'm trying to navigate is found here. The field I'm trying to access via xpath is the first "Results 1 - 10 of n".

Based on FireBug's 'inspect element' the xpath should be: /html/body/div/center/table/tbody/tr[6]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/span

However when I try to use this xpath to identify the element in a C# codebehind, it gives me a number of errors that that path cannot be found.

Am I doing something wrong here? I've tried a number of permutations of the xpath and I don't understand why this wouldn't be cooperating within code.

Edit: I'm having this problem both in HTMLAgilityPack (but managed to hack out a bad solution using regexes instead) and a SELECT statement modeled after the answer found here

Edit 2: I'm trying to figure out this issue using Yahoo's free proxy as shown in the example here:

var query = 'SELECT * FROM html WHERE url="http://mattgemmell.com/2008/12/08/what-have-you-tried/" and xpath="//h1" and class="entry-title"';
var url = "http://query.yahooapis.com/v1/public/yql?q=" + query + "&format=json&callback=??";


$.getJSON(url,function(data){
    alert(data.query.results.h1.content);
})

I'm having the same problems with HTML agility pack but I'm more interested in getting this part to work. It works for the provided URL that the answerer gave me (seen above). However when I try to use even simple xpath expressions on the http://nl.newsbank.com url, I get errors that no object has been retrieved every time, no matter how basic the xpath.

Edit 3: I thought I'd elaborate a little more on the big picture of the larger problem I'm trying to solve of which this problem is a critical component in the hopes that maybe it provides a little more insight.

To learn basic ASP.NET development skills from scratch, I decided to make a simple web application, based around the news archive search at http://nl.newsbank.com/. In its current iteration, it sends a POST request (although I've now learned you can use a GET request and just dump the body at the end of the URL) to send search criteria, as if the user entered criteria in the search bar. It then searches the response (using RegExes, not Xpath because I couldnt get that working) for the "Results 1-n of n" span, extracts n, and dumps it in a table. It's a cool little tool for looking up news occurrence rates over time.

I wanted to add functionality such that you could enter a date range (say May 2002 - June 2010) and run a frequency search for every month / week in that range. This is very easy to implement conceptually. HOWEVER the problem is, right now all this happens server side, and since there's no API, the HTTP response contains the entire page, and is therefore very bandwidth intensive. Sending dozens of queries at once would swallow absolutely unspeakable amounts of bandwidth and wouldn't be even a little scalable.

As a result I tried rewriting the application to work client-side. However because of the same-origin policy I'm not able to send a request to an external host from the client-side. HOWEVER there is a loophole that I can use a free Yahoo proxy that makes the request and converts it to JSON, and then I can use the JSON exception of the Same-Origin Policy to retrieve that data from the proxy.

Here's where I'm running into these xpath problems specific to http://nl.newsbank.com. I'm not able to retrieve html with any xpath, and I'm not sure why or how I can fix it. When debugging in VS2010, I'll receive the error Microsoft JScript runtime error: Unable to get value of the property 'content': object is null or undefined

Community
  • 1
  • 1
UpQuark
  • 791
  • 1
  • 11
  • 35
  • Page has a number of IFrames. Depending on what technology you are using, it may need a little "push" to deal with the IFrames. What technology are you using? LINQ2XML? HtmlAgilityPack? Selenium? Watin? – Arran Aug 21 '13 at 17:20
  • Both HtmlAgilityPack and a SELECT statement modeled after the linked answer in the edit (which I don't intricately understand the code of but seems to be working in the example) – UpQuark Aug 21 '13 at 17:31
  • Try removing the `tbody` elements in your expression. `tbody` elements are usually added by Firefox or Chrome when they don't appear in the source HTML before `tr`. So something like `/html/body/div/center/table/tr[6]/td/table/tr/td[2]/table/tr/td/table/tr/td/table/tr/td/table/tr/td/span` might work. Or to be safer `/html/body/div/center/table//tr[6]/td/table//tr/td[2]/table//tr/td/table//tr/td/table//tr/td/table//tr/td/span` – paul trmbrth Aug 21 '13 at 22:24
  • No luck. Still unable to find the element. – UpQuark Aug 22 '13 at 20:00
  • 1
    About Html Agility Pack: there is potentially a huge difference between the HTML that goes on the wire (the one the library uses in general) and the browser's in-memory DOM, as the second one can be completely created by client-side javascript. Otherwise, IMHO you shouldn't use XPATH like the one you show (this is *not* simple XPATH), this is a recipe for doom. XPATH expressions on HTML should be tailored to the only minimum discriminant elements. XPATH given by debug tools are hardly usable. You should use thinks lile `//td[@class='blabla']` or `//span[@id='blabla'` instead. – Simon Mourier Aug 23 '13 at 22:01
  • Could it be an issue with the default namespace? Have a look at [this question](http://stackoverflow.com/questions/2524804/how-do-i-use-xpath-with-a-default-namespace-with-no-prefix) to see how it could work. – Martin Höller Aug 30 '13 at 11:39

3 Answers3

1

Your sample HTML page's elements haven't got many classes to select on, but if you're interested in the first <span> element that contains "Results: 1 - 10 of n", you can use an XPath expression that explicitly targets this textual content.

For example:

//table//span[starts-with(., "Results:")]

will select all <span> elements, contained in a <table>, and that contain text beginning with "Results:" (the //table is not strictly necessary in your case I think, but might as well restrict a little)

You want the first one of these <span>, so you can use this expression:

(//table//span[starts-with(., "Results:")])[1]

Note the brackets around the whole previous expression and then [1] to select the first of all the <span> matching the text

paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
1

As paul t. already mentioned in a comment, the TBODY elements are generated by the webkit engine. The next problem is that the DIV between the BODY and CENTER does not exist on the page by default. It is added by an JS statement on line 119.

After stripping out the DIV and TBODY elements like

/html/body/center/table/tr[6]/td/table/tr/td[2]/table/tr/td/table/tr/td/table/tr/td/table/tr/td/span

i can successfull select a node with the HthmlAgilityPack.

Edit: don't use tools like Firebug for getting an XPath value on a website. Don't even use it if you just want wo look at the source of the page. The problem with Firebug is, that it will show you the current DOM document tree which probably on almost every is already (heavily) modified by JS.

Karganium
  • 76
  • 3
  • Having some ambiguous problems with selecting the node which I'll continue to look at but it's actually getting content back! Thanks much, have a bounty with compliments. – UpQuark Aug 30 '13 at 16:30
  • Nevermind, it still seems hosed for the first reason. I'll look at it in more detail. – UpQuark Aug 30 '13 at 17:57
0

It may sound kind of simplistic, but the element you are looking for is the only doc element that is using the css class "basic-text-white". I would think this would be a lot easier to find and extract than a long xpath. Web-scraping is never a stable thing, but I would think this is probably as stable as the xpath. Trying to debug the xpath just about makes my eyes bleed.

Gary Walker
  • 8,831
  • 3
  • 19
  • 41
  • I thought of that. Other than the fact that the element appears twice (with the same content though, so that shouldn't be hard to work around) I can't get this to work with the Nl.newsbank link, even though it seems to in other examples. Take a look at my latest edit. – UpQuark Aug 28 '13 at 14:54
  • I grabbed the context directly as it came off of the wire (i.e. not as the browser interpreted it, just the raw http result). Treat the raw text as just a stream of bytes and chop it up by hand based on "basic-text-white", no reason to go through the XML parsing at all. Seemed to be quite easy that way. – Gary Walker Aug 29 '13 at 23:08