Node: Get image from PDF whose URL has weird query string

Question

As part of #1917Live, I've made a Twitter bot that tweets 100-year-old New York Times articles about Russia.

It uses the New York Times' Article Search API to get the articles and then uses twit to tweet them.

I also try to make the tweets more engaging, like an actual newspaper would try to do. So I parse the headlines to make them more readable, tag users that are part of #1917Live, and add a hashtag.

Now here's the part where I'm stuck. Each article comes with a URL to a pdf file showing how it looked when it was printed. Here's an example. I want to download that pdf, convert the first page into an image, and attach the image to the tweet. This is the simplified code I tried to use to get the PDF:

var http = require('http');
var fs = require('fs');

var url = "http://query.nytimes.com/mem/archive-free/pdf?res=9500E4DC153AE433A25756C1A9629C946696D6CF";

var file = fs.createWriteStream("file.pdf");
var request = http.get(url, function(response) {
  response.pipe(file);
});

But this does not work. If I were trying to download a normal pdf file, with a .pdf file extension, I suspect I wouldn't be having any problems. But this is different. Any help would be very much appreciated.

`No 'Access-Control-Allow-Origin' header is present on the requested resource.` — guest271314, Apr 16 '17 at 04:05

guest271314 · Answer 1 · 2017-04-16T05:15:07.180

0

You can use YQL to get JSON result of query, get "url" property of "result" property of "query" property of JSON response which, if successful, will be .pdf file as a data URI

let url = "https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20data.uri%20where%20url%3D%22http%3A%2F%2Fquery.nytimes.com%2Fmem%2Farchive-free%2Fpdf%3Fres%3D9500E4DC153AE433A25756C1A9629C946696D6CF%22&format=json&callback=";

fetch(url).then(response => response.json())
.then(({query:{results:{url}}}) => console.log(url))
.catch(err => console.log(err));

Note, the resource returns an html document, not a .pdf document. To get the URL of the .pdf at html document, set the html as .innerHTML of <template> element, then query <iframe> .src.

The URL at <iframe> also has an expires header, where 403 (Forbidden) is returned as response after an as of yet unknown duration.

let url = "https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20data.uri%20where%20url%3D%22http%3A%2F%2Fquery.nytimes.com%2Fmem%2Farchive-free%2Fpdf%3Fres%3D9500E4DC153AE433A25756C1A9629C946696D6CF%22&format=json&callback=";
let template = document.createElement("template");
fetch(url).then(response => response.json())
.then(({query:{results:{url}}}) => 
  fetch(url).then(res => res.text())
  .then(html => {
    template.innerHTML = html;     
    let iframe = document.createElement("iframe");
    let src = template.content.querySelector("iframe").src; 
    console.log(src);
    iframe.src = src.slice(0, src.indexOf("?"));
    document.body.appendChild(iframe);
  })
)
.catch(err => console.log(err));

edited Apr 16 '17 at 05:15

answered Apr 16 '17 at 04:17

guest271314

1
15
104
177

Note, the response from resource at `url` is `"text/html"` containing an `` element which appears to render `.pdf` `document`. – guest271314 Apr 16 '17 at 04:37
It's logging `data:text/html;charset=UTF-8;base64` and then some really long string. What do I do with that? Also, if you try that URL here (https://developer.yahoo.com/yql/), it returns an error. – Harry Stevens Apr 16 '17 at 04:50
The response from the resource is a `data URI` which is `"text/html"`, not `.pdf`. Within the `html` there is an `` element, the `src` of `<iframe>` is set to `.pdf`. The `html` needs to be parsed for `<iframe>` `src`, though note, the URL does have an `"&Expires"` query string, though could probably be removed. – guest271314 Apr 16 '17 at 04:53
Did not get error here when tried at YQL. What was the query you tried? – guest271314 Apr 16 '17 at 05:00
https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20data.uri%20where%20url%3D%22http%3A%2F%2Fquery.nytimes.com%2Fmem%2Farchive-free%2Fpdf%3Fres%3D9500E4DC153AE433A25756C1A9629C946696D6CF%22&format=json&callback= – Harry Stevens Apr 16 '17 at 05:05
There is an additional step necessary to get the URL of the `.pdf` from `` within `html` string – guest271314 Apr 16 '17 at 05:06

score 0 · Answer 2 · answered Apr 22 '17 at 05:29

It turns out there was a much easier, and more obvious, way. I just used request and cheerio, as I should have done from the beginning.

var request = require("request"),
  cheerio = require("cheerio");

var url = "http://query.nytimes.com/mem/archive-free/pdf?res=9500E4DC153AE433A25756C1A9629C946696D6CF";

request(url, function(error, response, body){
  if (!error && response.statusCode == 200){
    var $ = cheerio.load(body);

    var pdf = $("iframe").attr("src");
    console.log(pdf);
  }
});

Node: Get image from PDF whose URL has weird query string

2 Answers2