I'm trying to use YQL to retrieve HTML from a web page and am running into some trouble.
Big picture: I'm trying to write an application that sends search criteria to a site in the form of a GET request and then extracts the number of results from the response. The site is http://nl.newsbank.com and can be used to search U.S. news articles. The process of searching is fairly simple: you can send a GET request like this one here that runs a search for all articles containing the keyword "pizza" (you can look at the link to see how the query is structured). My application sends this request, then extracts the number of results as shown by the "Results: 1 - n of n" label.
This is simple on paper and easy to implement server-side. However because this isn't a 'real' API and I have to load the entire page to get the very small piece of data that I care about, it's a lot more bandwidth intensive than I'd prefer, along with other annoying aspects of doing this server-side.
I'm trying to implement a similar functionality client-side using YQL as suggested in this answer. The problem is that while the provided example:
var query = 'SELECT * FROM html WHERE url="http://mattgemmell.com/2008/12/08/what-have-you-tried/" and xpath="//h1" and class="entry-title"';
var url = "http://query.yahooapis.com/v1/public/yql?q=" + query + "&format=json&callback=??";
$.getJSON(url,function(data){
alert(data.query.results.h1.content);
})
works perfectly and exactly in the way I would expect it to, I'm not able to do the same thing for the http://nl.newsbank.com searches. I had some trouble using xpaths in the above way.
There are two cases: when I try to run the GET request which as you can see loads up fine in a browser, I get the following error from data
no matter what xpath I enter: `Query syntax error(s) [line 1:89 mismatched character ' ' expecting '"']
Alternately, when I just try to retrieve html from http://nl.newsbank.com I get null HTML from YQL.
I don't get null HTML When using other types of access (like server-side use of HTMLAgilityPack or just a browser) and as you can see if you try the example in the jsfiddle, it works fine for other websites, so I'm utterly mystified as to why this doesn't cooperate for this specific website.
Any help is enormously welcome.
Edit: An example YQL query construction that fails is:
var xpath = "//span[@class='basic-text-white']";
var query = 'SELECT * FROM html WHERE url="http://nl.newsbank.com/nl-search/we/Archives/?s_siteloc=NL2&p_queryname=4000&p_action=search&p_product=NewsLibrary&p_theme=newslibrary2&s_search_type=customized&d_sources=location&d_place=United%20States&p_nbid=&p_field_psudo-sort-0=psudo-sort&f_multi=&p_multi=&p_widesearch=smart&p_sort=YMD_date%3aD&p_maxdocs=200&p_perpage=10&p_text_base-0=SEARCHTERM&p_field_base-0=&p_bool_base-1=AND&p_text_base-1=&p_field_base-1=Section&p_bool_base-2=AND&p_text_base-2=&p_field_base-2=&p_text_YMD_date-0=April_1_2001_to_April_1_2012&p_field_YMD_date-0=YMD_date&p_params_YMD_date-0=date%3aB,E&p_field_YMD_date-3=YMD_date&p_params_YMD_date-3=date%3aB,E&Search.x=18&Search.y=18" and xpath="'+xpath+'"';
var url = "http://query.yahooapis.com/v1/public/yql?q=" + query + "&format=json&callback=??";