I am making a scraper with Node.js (using Request.js and Cheerio.js) and am navigating to download links to download PDFs and add them to a folder on my computer. The links start the download automatically, they don't just navigate to the rendered PDF, so I am not sure how to pipe the download to the folder from Node.
Asked
Active
Viewed 2,210 times
1 Answers
1
I'm not sure what you mean by rendered PDF, but if you know the url to the document just send an ajax call to get the raw data, and dump it to an output file.
So here for writing files with Nodejs. Writing files in Node.js
Example from: http://www.sitepoint.com/making-http-requests-in-node-js/
var request = require("request");
var fs = require("fs");
request("http://www.sitepoint.com").pipe(fs.createWriteStream("jspro.htm"));
-
Thanks for the response. I have almost that exact snippet in my code, but when I navigate to the folder I am sending the PDFs to, and try to open them, they won't open. They look to be empty (they're only 10 bytes large). I suspect this has to do with the fact the link isn't to the actual PDF, it just starts the automatic PDF download. – user3821746 Jul 09 '14 at 18:40
-
Try opening the file using textedit/vi. My hunch is it fed you a redirect link. If that is the case you just need to keep following the link trail until you hit the actual pdf. Remember the pdf must exist at a url for your browser to download it, unless the site has a crazy bizarre cms setup. – droghio Jul 09 '14 at 18:43
-
That was very helpful, thanks! The files only say "The page you requested was removed." which is the body of the response of the links. I'm not quite sure where to go from here because as far as I know, I have no way to follow the "link trail". When I load the given links in the browser, nothing is rendered (so I can't do a view source), the download just begins. – user3821746 Jul 09 '14 at 18:50
-
This is the tricky part. How did you navigate to the links from the browser? Did you load the homepage and navigate to it, or did you copy/paste the url provided by your scrapper? That page removed error might be because the site uses cookies/headers to determine which file to serve you. Cheerios doesn't store/deal with those. If that's the case then you might want to look into a headless browsing solution. I've had luck with PhantomJS, but keep in mind the scrape will be a little slower. – droghio Jul 09 '14 at 19:00
-
Here is my initial [link](https://phila.legistar.com/LegislationDetail.aspx?ID=1814375&GUID=C6B2288F-F596-42BA-B568-95EB8FA89816&Options=ID|Text|&Search=) that I'm scraping. On this page, there are two attachments [1](https://phila.legistar.com/View.ashx?M=F&ID=3126164&GUID=A31244A6-A813-4B71-997F-763771FCD455) and [2](https://phila.legistar.com/View.ashx?M=F&ID=3133507&GUID=D6DC41D6-744E-4E5D-B3D5-3400CA4200A2). I get these two links from my scraper, so I just then navigate to the links and try to pipe the response, but since the automatic download isn't the response, nothing is piped. – user3821746 Jul 09 '14 at 19:07
-
Ok, the problem doesn't seem to be the automatic download. The site checks the headers on your request, plus a few cookies. Even if the pdf isn't rendered you can still look at the ajax call in Firefox. (Right click the link, select open in new tab, then in the tab go to tools > web developer > toggle tools, then finally select the url and hit enter.) Your best bet is to probably switch to a headless browser that'll populate the header for you, but if you want to stick with cheerios you'll need to figure out what's the magic part of the request and send it manually. – droghio Jul 09 '14 at 19:34
-
Thanks a ton for the help. I just want to make sure I'm understanding correctly: when I go to one of the two attachment links with developer tools on, I can see the request (and response) headers. So I need to find out which part of the request headers actually gets the download to be sent, and once I figure that out, I just need to send the same request headers through Request.js? Once I get the link to the pdf I don't actually use Cheerio.js anymore, I just have request(link).pipe as you have above. – user3821746 Jul 09 '14 at 20:01
-
No problem, and kindof. There is something in your header or cookie that the server is validating before sending your pdf. Your browser takes care of that data automatically. Cheerios can't, so either you need to implement the cookie/header handling, or use a library that does (like PhantomJS). I would highly highly recommend looking into the headless browser approach, many give you dom access similar to cheerios, but again the choice is yours. Either way, you are using the header just to get the data, and then you'll need to pipe it to the file you want. – droghio Jul 10 '14 at 15:18
-
I took your advice and just added all the headers I saw in the developer tools to the request(link).pipe() line, and I was able to pipe the PDFs to my folder. My only issue now is that I am starting from an RSS feed that contains 100 links to bills (each link is like the first link above) and when I go into each link to grab their PDF(s), I get a socket hangup, and only the first PDF is grabbed. This is a separate issue though, so thanks for helping me out with this one! – user3821746 Jul 10 '14 at 15:37