2

In my application I used Yahoo's YQL API to extract HTML from other websites, but yahoo stopped the API and Yahoo's YQL API for extracting HTML will not work anymore.

{
 "query": {
  "count": 0,
  "created": "2017-06-26T12:57:49Z",
  "lang": "en-US",
  "meta": {
   "message": "html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use"
  },
  "results": null
 }
}

It can be read here.

This is how I did it so far:

$(function () {
    var fileFieldId;
    var fileFieldClass;
    var query;
    var apiUrl;
    $(".data-from-url").keyup(function () {
        fileFieldId = $(this).attr('id');
        fileFieldClass = $(this).attr('class');
        fileFieldVal = $(this).val();
        query = 'select * from html where url="' + $(this).val() + '" and xpath="*"';
        apiUrl = 'https://query.yahooapis.com/v1/public/yql?q=' + encodeURIComponent(query);

        $.get(apiUrl, function(data) {
          var html = $(data).find('html');
          $("input.post[data-title='" + fileFieldId + "']" ).val(html.find("meta[property='og:title']").attr('content') || 'no title found');
          $("textarea.post-description[data-description='" + fileFieldId + "']" ).val(html.find("meta[property='og:description']").attr('content') || 'no title found');
          $("input.post-remote-image[data-img='" + fileFieldId + "']" ).val(html.find("meta[property='og:image']").attr('content') || '');

    });

});

Here is a jsfiddle for call I am doing

  $(function () {
      var query;
      var apiUrl;
      $("button.click").click(function () {
          //query = 'select * from htmlstring where url="' + $(this).val() + '" and xpath="//a"&format=json&env=store://datatables.org/alltableswithkeys&callback=';
          apiUrl = "https://query.yahooapis.com/v1/public/yql?q=select * from htmlstring where url='http://stackoverflow.com/'&format=json&diagnostics=true&env=store://datatables.org/alltableswithkeys&callback=";
          $('p.extract').toggle();
          $.get(apiUrl, function(data) {
           $('p.extract').addClass('none');
            var html = $(data).find('html');
            $("input.title" ).val(html.find("meta[property='og:title']").attr('content') || 'no title found');
             $("textarea.description").val(html.find("meta[property='og:description']").attr('content') || 'no title found');
            $("input.image").val(html.find("meta[property='og:image']").attr('content') || '');

      });

  });
    });
input {
    width: 100%;
    margin-bottom: 20px;
    padding: 10px;
}

.none{display:none;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<button class="click">Click Me</button>
<br>
<p class="extract" style="display:none;">Extracting html</p>
<input type="text" class="title">
<br>
<textarea name="" id="" cols="30" rows="5" class="description"></textarea>
<br>
<input type="text" class="image">

Is there other alternative for extracting HTML meta from other sites head?

  • Share the query string that you have used. – CodeIt Jun 26 '17 at 13:07
  • Create a server side scraper – charlietfl Jun 26 '17 at 13:07
  • @CodeIt I just added the query into the question –  Jun 26 '17 at 13:13
  • Using this [query](https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%20where%20url%3D'http%3A%2F%2Fstackoverflow.com%2F'%20and%20xpath%3D'%2F%2Fa'&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=) i was able to get the complete html of stackoverflow. If that works i will post it as answer. – CodeIt Jun 26 '17 at 13:15
  • Thanks @CodeIt, appreciate your help. But how then would it work if the API is down? Yeah as long as I can extract the `meta` data from `head` I would appreciate it :) –  Jun 26 '17 at 13:21
  • It is community powered. So don't worry. – CodeIt Jun 26 '17 at 13:26
  • Oh great. I would appreciate if you can add it as the answer :) So I can extract the `meta` data –  Jun 26 '17 at 13:27
  • @FriendofAfriend I have posted it as [answer](https://stackoverflow.com/a/44761039/3091398). – CodeIt Jun 26 '17 at 13:36
  • Possible duplicate of [YQL: html table is no longer supported](https://stackoverflow.com/questions/44431212/yql-html-table-is-no-longer-supported) – blakeo_x Jul 07 '17 at 18:57

2 Answers2

1

Extracting HTML with YQL

http://developer.yahoo.com/yql/console/?q=select%20*%20from%20htmlstring%20where%20url%3D'YOUR_ENCODED_URL_HERE'&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys

Example

http://developer.yahoo.com/yql/console/?q=select%20*%20from%20htmlstring%20where%20url%3D'http%3A%2F%2Fstackoverflow.com%2F'&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys

REST Query

https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%20where%20url%3D'http%3A%2F%2Fstackoverflow.com%2F'&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=

Source

htmlstring is a part of community Open Data tables.

CodeIt
  • 3,492
  • 3
  • 26
  • 37
  • Is this correct what I doing: `query = 'select * from html where url="' + $(this).val() + '" and xpath="*"';` `apiUrl = 'http://developer.yahoo.com/yql/console/?q=' + encodeURIComponent(query);` ? –  Jun 26 '17 at 13:42
  • Its not from html you need to select from htmlstring. See [here](https://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=select+*+from+htmlstring+where+url%3D'http%3A%2F%2Fwww.yahoo.com%2F'+and+xpath%3D'%2F%2Fa'); – CodeIt Jun 26 '17 at 13:44
  • Thanks man, Appreciate your help. [Im doing this](https://jsfiddle.net/rubioli/qth5fdtq/), but getting error: **GET https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%…e%20url%3D%22https%3A%2F%2Fstackoverflow.com%2F%22%20and%20xpath%3D%22*%22 400 (Bad Request)** dont understand what is wrong. Appreciate your help :) –  Jun 26 '17 at 14:00
  • My query becomes: *https://query.yahooapis.com/v1/public/yql?q=select * from htmlstring where url="https://stackoverflow.com/" and xpath="*"* and I get a error –  Jun 26 '17 at 14:09
  • I just ran the whole query directly and still can't grab the `meta` from `head`. And I get the **no found** :/ –  Jun 26 '17 at 14:21
  • @FriendofAfriend I have edited my answer. Removed the xpath from the query, it was causing the problem. – CodeIt Jun 26 '17 at 14:50
  • Thanks for taking your time. It still not extracting the `Meta` data from head. I created a [jsfiddle here](https://jsfiddle.net/rubioli/qth5fdtq/6/). As you can see it doesn't extract them. Can you where Im wrong? –  Jun 26 '17 at 20:34
  • @FriendofAfriend The returned data is json object. This is your hint. As question is titled `Is there other options for Yahoo's YQL for extracting HTML from other websites` this answers the question. – CodeIt Jun 27 '17 at 02:59
  • You can get the xml version of the same [here](https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%20where%20url%3D'http%3A%2F%2Fstackoverflow.com%2F'&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys). From that extract the contents of result tag to get the complete html. – CodeIt Jun 27 '17 at 04:20
0

You might be able to read the meta tags using queryselector? I use fetch to grab google docs which helpfully has all the document properties in the html meta tags. I then put the html into a temporary object which I can hit with queryselector as I see fit. Something like:

var url = "https://docs.google.com/presentation/d/1blSsU5LHnrjSjb7voHXkRA_NlWo3yNjLiyttmoWfslM/edit#slide=id.gcb9a0b074_1_0"
var id = url.split("://")[1].split("/")[3];
var source = "https://docs.google.com/presentation/d/" + id + "/edit?usp=sharing";
fetch(source).then(function(response) {
        return response.text();
    }).then(function(html) {
        var doc = document.implementation.createHTMLDocument("foo");
        doc.documentElement.innerHTML = html;
        return doc.querySelector("meta[property='og:description']").getAttribute("content");
    }).then(function(title) {
       console.log("document title", title);
    });
frumbert
  • 2,323
  • 5
  • 30
  • 61