I'm having what seems like a really simple problem. I'm trying to navigate to an element in HTML by Xpath, and can't seem to get it to function properly.
I want to grab a span from the html contents of a page. The page is fairly complex, so I've been using Firebug's "get element by xpath" and pasting the result into my code. I've noticed it's slightly different than the xpath you get from doing the same thing in Chrome, but they both seem to direct to the same place.
The html I'm trying to navigate is found here. The field I'm trying to access via xpath is the first "Results 1 - 10 of n".
Based on FireBug's 'inspect element' the xpath should be: /html/body/div/center/table/tbody/tr[6]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/span
However when I try to use this xpath to identify the element in a C# codebehind, it gives me a number of errors that that path cannot be found.
Am I doing something wrong here? I've tried a number of permutations of the xpath and I don't understand why this wouldn't be cooperating within code.
Edit: I'm having this problem both in HTMLAgilityPack (but managed to hack out a bad solution using regexes instead) and a SELECT statement modeled after the answer found here
Edit 2: I'm trying to figure out this issue using Yahoo's free proxy as shown in the example here:
var query = 'SELECT * FROM html WHERE url="http://mattgemmell.com/2008/12/08/what-have-you-tried/" and xpath="//h1" and class="entry-title"';
var url = "http://query.yahooapis.com/v1/public/yql?q=" + query + "&format=json&callback=??";
$.getJSON(url,function(data){
alert(data.query.results.h1.content);
})
I'm having the same problems with HTML agility pack but I'm more interested in getting this part to work. It works for the provided URL that the answerer gave me (seen above). However when I try to use even simple xpath expressions on the http://nl.newsbank.com url, I get errors that no object has been retrieved every time, no matter how basic the xpath.
Edit 3: I thought I'd elaborate a little more on the big picture of the larger problem I'm trying to solve of which this problem is a critical component in the hopes that maybe it provides a little more insight.
To learn basic ASP.NET development skills from scratch, I decided to make a simple web application, based around the news archive search at http://nl.newsbank.com/. In its current iteration, it sends a POST request (although I've now learned you can use a GET request and just dump the body at the end of the URL) to send search criteria, as if the user entered criteria in the search bar. It then searches the response (using RegExes, not Xpath because I couldnt get that working) for the "Results 1-n of n" span, extracts n, and dumps it in a table. It's a cool little tool for looking up news occurrence rates over time.
I wanted to add functionality such that you could enter a date range (say May 2002 - June 2010) and run a frequency search for every month / week in that range. This is very easy to implement conceptually. HOWEVER the problem is, right now all this happens server side, and since there's no API, the HTTP response contains the entire page, and is therefore very bandwidth intensive. Sending dozens of queries at once would swallow absolutely unspeakable amounts of bandwidth and wouldn't be even a little scalable.
As a result I tried rewriting the application to work client-side. However because of the same-origin policy I'm not able to send a request to an external host from the client-side. HOWEVER there is a loophole that I can use a free Yahoo proxy that makes the request and converts it to JSON, and then I can use the JSON exception of the Same-Origin Policy to retrieve that data from the proxy.
Here's where I'm running into these xpath problems specific to http://nl.newsbank.com. I'm not able to retrieve html with any xpath, and I'm not sure why or how I can fix it. When debugging in VS2010, I'll receive the error Microsoft JScript runtime error: Unable to get value of the property 'content': object is null or undefined