18

I use YQL to get some html-pages for reading information out of it. Since today I get the return message "html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use"

Example in the console: https://developer.yahoo.com/yql/console/#h=select+*+from+html+where+url%3D%22http%3A%2F%2Fwww.google.de%22

Did Yahoo stop this service? Does anybody know a kind of announcement from Yahoo? I am wondering whether this is simply a bug or whether they really stopped this service...

All documentation is still there (html scraping): https://developer.yahoo.com/yql/guide/yql-select-xpath.html , https://developer.yahoo.com/yql/

A while ago I posted in an YQL forum from Yahoo, now this one does not exist anymore (or at least I do not find it). How can you contact Yahoo to find out whether this service really stopped?

Best regards, hebr3

hebr3
  • 307
  • 2
  • 6
  • Yes, not working for me too. They give us a link to the "YQL Terms of Use" page but it is no help. It seems the YQL service is still operational but as the error message states the "HTML table" query is just not supported any more. So, I'm trying to find another way to scrape an HTML table from a web page. Perhaps there is another YQL service out there that can help extract a table from a web page or there is some alternative query in YQL I can try. I guess I will have to read docs on YQL to find out. – user1467483 Jun 08 '17 at 13:12
  • @user1467483 the error is not due to HTML tables. It's related to the YQL table named "html". Think of YQL like any other query language -- information is stored in table structures. In regards to finding an alternative to YQL, that's not necessary. You just have to find an alternative YQL table. See my answer – blakeo_x Jun 09 '17 at 18:19
  • I'm on GAE using YQL html table JSON output and going to refactor scraping using lxml. For not breaking the interface to existing code, it would be useful to have sample YQL output at hand, especially JSON, which was quite peculiar. The [XML-to-JSON-transformation documentation](https://developer.yahoo.com/yql/guide/xml_to_json.html) is not a full spec (e.g. how did it handle mixed nodes?). Please share samples html vs. json, like [this one](https://stackoverflow.com/a/8763933/591336). – vicmortelmans Jun 12 '17 at 11:58
  • Here's a Python gist that can be useful for refactoring a YQL html query returning JSON, by using the lxml module with XPATH query and converting the output to YQL's JSON format, to avoid breaking the interface to other code: [https://gist.github.com/vicmortelmans/5ee79080249ed5e0a173bc9e6fd426b1](https://gist.github.com/vicmortelmans/5ee79080249ed5e0a173bc9e6fd426b1) – vicmortelmans Jun 18 '17 at 12:57
  • Same issue here. Broke my script and took some time to find out that this table is no longer supported. There are other public proxies (https://stackoverflow.com/questions/15005500/loading-cross-domain-endpoint-with-jquery-ajax), but they all have some limitations and can be blocked away if there are too many requests unlike yahoo with it's cache. – Sergey Novikov Jun 08 '17 at 13:30

4 Answers4

18

It looks like Yahoo did indeed end their support of the html library as of 6/8/2017 (according to my error logs). There doesn't appear to be any official announcement of it yet.

Luckily, there is a YQL community library that can be used in place of the official html library with few changes to your codebase. See the htmlstring table in the YQL Console.

Change your YQL query to reference htmltable instead of html and include the community environment in your REST query. For example:

/*/ Old code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from html where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json";

 

/*/ New code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from htmlstring where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json"
    + "&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";
Potherca
  • 13,207
  • 5
  • 76
  • 94
blakeo_x
  • 495
  • 5
  • 14
  • 3
    Thank you very much for this hint. I use only the public version of YQL, for htmlstring I would have to use one with authentication. In any case I am done with Yahoo YQL - I had now several issues with their stability, availability, etc. (though it is a free service I would need reliability and this doesn't seem to exist). I did now set up my own server and use my own web service to get the html pages I need. – hebr3 Jun 10 '17 at 11:01
  • I'm able to use htmlstring without authentication. I wonder why you aren't. PS, if my answer is suitable, please consider marking it as the accepted answer. – blakeo_x Jun 10 '17 at 16:30
  • @blakeo_x your answer is correct, only thing is that `Yahoo APIs` has to be serve over `https` and no `html` – Rubioli Jun 27 '17 at 07:10
  • @user6589814 I'm able to hit the API over http. Are you receiving an error when you try it? Also, the `html` table is only provided as an example of an old query. My suggested solution is to use `htmlstring` – blakeo_x Jun 27 '17 at 16:13
  • @blakeo_x I wouldn't do that. Reason is if you use your time to build a new api or script and for some reason `http` works. I'm sure they will put it down soon since they were going to stop the whole package. – Rubioli Jun 28 '17 at 07:48
  • I'm missing your meaning. Are you saying you think they'll remove the `htmlstring` table as well? If so, I disagree because `htmlstring` is a community-provided table, not officially from Yahoo. So Yahoo has no duty to devote development time to supporting it, ergo they don't mind if it stays. Or are you saying you think they'll remove `http` access? Again, I dsagree. No API that receives and serves publicly available data should require security. That's just overkill. – blakeo_x Jun 28 '17 at 15:20
  • Hey @blakeo_x I'm getting this issue, when I'm using your given code : XMLHttpRequest cannot load http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%2…3D%27*%27&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:8080' is therefore not allowed access. The response had HTTP status code 999. – Menu Jul 10 '17 at 21:58
  • I'd like to mention the existence of the `json` table, for those, like myself, who were using the `html` table to retrieve the JSON content returned by a URL (along with the callback parameter -- JSONP). – nyg Aug 01 '17 at 21:44
  • 2
    htmlstring thing is working randomly, sometime works, sometime fail – Muhammad Faisal Iqbal Aug 23 '17 at 11:14
  • I am experiencing htmlstring working sometimes and not others. Seems to be about 50%/50%. Do we have a service solution that is more dependable? – folktrash Oct 04 '17 at 19:54
  • @nyg I tried "from json" but it failed. What is the name of this table? – thdoan Jun 12 '18 at 22:22
0

Thank you very much for your code.

It helped me to create my own script to read those pages which I need. I never programmed PHP before, but with your code and the wisdom of the internet I could change your script to my needs.

PHP

<?
    header('Access-Control-Allow-Origin: *'); //all
    $url = $_GET['url'];
    if (substr($url,0,25) != "https://www.xxxx.yy") {
       echo "Only https://www.xxxx.yy allowed!";
       return;
    }
    $xpathQuery = $_GET['xpath'];

    //need more hard check for security, I made only basic
   function check($target_url){
       $check = curl_init();
       //curl_setopt( $check, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
        //curl_setopt($check, CURLOPT_INTERFACE, "xxx.xxx.xxx.xxx");
        curl_setopt($check, CURLOPT_COOKIEJAR, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_COOKIEFILE, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_TIMEOUT, 40000);
        curl_setopt($check, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($check, CURLOPT_URL, $target_url);
        curl_setopt($check, CURLOPT_USERAGENT,   $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($check, CURLOPT_FOLLOWLOCATION, false);
        $tmp = curl_exec ($check);
        curl_close ($check);
        return $tmp;
    } 

    // get html
    $html = check($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    // apply xpath filter
    $xpath = new DOMXPath($dom);
    $elements = $xpath->query($xpathQuery);
    $temp_dom = new DOMDocument();
    foreach($elements as $n)   $temp_dom->appendChild($temp_dom->importNode($n,true));
    $renderedHtml = $temp_dom->saveHTML();

    // return html in json response
    // json structure: 
    // {html: "xxxx"}
    $post_data = array(
      'html' => $renderedHtml
    );  
    echo json_encode($post_data); 

?>

Javascript

$.ajax({
    url: "url of service",
    dataType: "json", 
    data: { url: url,
            xpath: "//*"
          },
    type: 'GET',
    success: function() {
             },
    error: function(data) {
           }
}); 
hebr3
  • 307
  • 2
  • 6
  • 2
    This might not be a solution for all as having it's own proxy all requests will end up on target site coming from your server. For some tasks this might be undesirable. The beauty of YQL were that you can access cached (sometimes not) versions of pages and to target site this would look as desired search indexing traffic. And to imitate cached versions to reduce requests you'll have to store, sometimes quite a lot of data. And it'll be more than one screen size script. So i consider it is not a general purpose answer. – Sergey Novikov Jun 12 '17 at 12:27
  • 1
    I agree with SerrNovik. This solution is a shallow alternative to YQL, not a way to make YQL behave as requested. It's worth contributing, but not a suitable answer to the original question. Additionally, many developers use YQL to eliminate CORS from the equation. Your solution only works for documents on the same host. – blakeo_x Jun 12 '17 at 16:24
  • yes, your are all right, I also liked the YQL html table - but YQL stopped the service without any warning (at least I did not receive one) and therefore my service did not work anymore --> From my point of view YQL was not reliable anymore and I needed a replacement – hebr3 Jun 13 '17 at 08:50
0

Even though YQL does not support the html table anymore, I've come to realize that instead of making one network call and parsing out the results it's possible to make several calls. For example, my call before would look like this:

select html from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

Which should give me the information as such below

enter image description here

Now I'd have to use these two:

select title from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

select description from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

.. to get what I want. I don't know why they would deprecate something like this without a fallback clearly listed but you should be able to get your data this way.

BruceWayne
  • 299
  • 2
  • 8
0

I build an open source tool called CloudQuery (source code)provide similar functionality as yql recently. It is able to turn most websites to API with some clicks.

timqian
  • 3,106
  • 1
  • 14
  • 11