1

Scenario:
We are required to enter data daily into a government database in a European country. We suddenly have a need to retrieve some of that data. But the only format they will allow is by PDFs generated from the data—hundreds of them. We would like to avoid sitting in front of a webbrowser clicking link after link.

The links generated look like

<a href='javascript:viajeros("174814255")'>
  <img src="img/pdf.png">
</a>

I have almost no experience with Javascript, so I don't know whether I can install a routine as a bookmark to loop through the DOM, find all the links, and call the function. Nor, if that's possible, how to write it.

The ID numbers can't be predicted, so I can't write another page or curl/wget script to do it. (And if I could, it would still fail as mentioned below.)

The 'viajeros' function is simple:

function viajeros(id){
  var idm = document.forms[0].idioma.value;
  window.open("parteViajeros.do?lang="+idm+"&id_fichero=" + id);
}

but feeding that URI to curl or wget fails. Apparently they check either a cookie or REFERER and generate an error.

Besides, with each link putting the PDF in a browser tab instead of in the downloads directory, we would still have to do two clicks (tab and save) hundreds of times.

What should I do instead?

For what it's worth, this is on MacOS 10.13.4. I normally use Safari, but I also have available Opera and Firefox. I could install Chrome, but that's the last resort. No, that's second to last: we also have a (shudder) Windows 10 laptop. THAT'S last resort.

(Note: I looked at the four suggested duplicates that seemed promising, but each either had no answer or instructed the asker to modify the code that generates the PDF.)

WGroleau
  • 448
  • 1
  • 9
  • 26
  • I am confused. What do you mean by "We suddenly have a need to retrieve some of that data"? And does " the only format they will allow is by PDF" mean you need to upload the PDF's created? What does " PDFs generated from the data" mean? What format is the data in? You can use JavaScript to select elements by going down a tree so elements without predictable ID's is fine as long as the element is consistent. – JBis May 24 '18 at 16:07

2 Answers2

1

I had a similar situation, where I have to download all the (invoice) pdf that were generated in a day or past week.

So after some research I was able to do the scraping using PhantomJS and later I discovered casperjs which made my job easy.

phantomJs and casperjs are headless browsers.

Since you have less experience with JS and If you are a C# guy then CefSharp may help you.

Some Useful links:

To get started with phantom, casper and cefSharp

PhantomJs

CasperJs

CefSharp

Try reading the documentation for downloading files.

Akash Preet
  • 4,695
  • 2
  • 14
  • 21
  • download() looks easy enough, but how much documentation would I have to read to generate the correct parameters for 318 calls? I'vew already spent enough time on the other answer to have done all the clicks I wanted to avoid. :-( – WGroleau May 27 '18 at 19:20
1
document.querySelectorAll("img[src=\"img/pdf.png\"]")
    .forEach((el, i) => {
      let id = el.parentElement.href.split("\"")[1];
      let url =
          "parteViajeros.do?lang=" + document.forms[0].idioma.value +
          "&id_fichero=" + id;
      setTimeout(() => {
        downloadURI(url, id);
      }, 1500 * i)
    });

This gets all of the images of the PDF icon, then looks at their parent for the link target. This href has its ID extracted, and passed to a string construction making the path to the file to be downloaded, similar to ‘viajeros’ but without the window.open. This URL is then passed to downloadURI which performs the download.

This uses downloadURI function from another Stack Overflow answer. You can download a URL by setting the download attribute on the link, then clicking it, which is implemented as so. This is only tested in Chrome.

function downloadURI(uri, name) {
  var link = document.createElement("a");
  link.download = name;
  link.href = uri;
  document.body.appendChild(link);
  link.click();
  document.body.removeChild(link);
  delete link;
}

Open the page with the links and open the console. Paste the downloadURI function first, then the code above to download all the links.

grg
  • 5,023
  • 3
  • 34
  • 50
  • This looks good. Don't really need the 'lang' part as the website and the PDFs are always in Spanish. I will try pasting both parts into a bookmark and give it a try. – WGroleau May 24 '18 at 20:56
  • There was already a bookmark containing javascript (for a different purpose) that was working. I pasted this code (function def first) in place of that, but when I clicked it, nothing happened. When I pasted it into the javascript console, also nothing happened. I couldn't figure out how to do the equivalent thing in Firefox. – WGroleau May 24 '18 at 22:27
  • Apparently pasting in all your code DID define the function, because when I subsequently pasted
    downloadURI("https://hospederias.guardiacivil.es/hospederias/parteViajeros.do?lang=&id_fichero=174814394","174814394");
    it did download the file. But pasting in two such calls in one operation only got one file. But pasting in again the document.….forEach loop made it say "undefined" but it wouldn't say WHAT was undefined.
    – WGroleau May 25 '18 at 10:01
  • Then I tried putting that loop inside a function get_em(){that code}, pasting that in so that the function would be defined. But then I pasted in "get_em();" and got nothing except the word "undefined." I've examined the loop, and see nothing wrong with it, but again, I am inexperieced in JS. – WGroleau May 25 '18 at 10:03
  • Then I tried redefining their function to call downloadURI but clicking on an icon still called their original function. – WGroleau May 25 '18 at 10:09
  • Finally, defined get_em() to just call downloadURI twice with two full URI s already hand-created. Called that, and it DID download, but only one file. Same as when pasting in two calls to downloadURI, it only downloads the second one. – WGroleau May 25 '18 at 10:32
  • SO, if I call downloadURI twice in a function with a string literal URI, it downloads the second one. So I know that a function can call downloadURI but I don't know how to make it do more than one. However, if your loop is in a function, it does not download anything. So, even though it looks OK to me, there must be a problem with that code. But the log shows no errors. I'm still baffled. – WGroleau May 25 '18 at 11:21
  • In Safari, if I replace 'downloadURI' with 'console.log' with no change _before that, then it prints the correct URI and name, but your original statement does nothing. If I call downloadURI with one set of correct parameters, it works, and if I make downloadURI contain only a console.log, that works. So for some unknown reason, your loop can call console.log but it can't call downloadURI (in Safari). I am going to install Chrome and see what happens. I could not get Firefox to do anything at all. Nor GreaseMonkey. – WGroleau May 27 '18 at 19:14
  • @WGroleau Sorry for not getting back to you sooner. I have a similar page behind an intranet which I was using for testing and I still can't reproduce the problem you're having. For reference, this is what [each PDF icon on my page](https://i.stack.imgur.com/LgUTg.png) is like and this is [the function that's called](https://i.stack.imgur.com/NW9Nm.png). It may be some slight difference I haven't picked up on in changing the code which works for my page to your page. Perhaps some delay is required between each call to downloadURI, maybe calling it too quickly causes it to not work? – grg May 27 '18 at 19:34
  • If I write a function that calls downloadURI once, and then call that function, it works. If I put two calls in the second function, only the last one downloads. This may be a timing issue, but a pause syntax that has supposedly been working in Javascript for a year, Safari said syntax error. BUT, the main problem is not timing. If I change downloadURI in your loop to console.log, it prints the entire list of correct URIs. But the original does not call downloadURI even once. If I redefine downloadURI to only call console.log, that proves that your loop doesn't call it. Bug in Safari? – WGroleau May 27 '18 at 19:47
  • @WGroleau I've been doing this all in Safari and it's been working for me. I'm not sure what you mean by ‘the original does not call downloadURI even once’, what are you referring to by ‘the original’? If you're changing my loop to console.log and that works, what doesn't work? – grg May 27 '18 at 20:21
  • When I _don't_ change it to console.log, nothing happens. And if I change downloadURI to do console.log, still nothing happens. But if I define function test(URI, name){downloadURI(URI,name);} and then type in test("one","two"); I see onetwo in the log. So if Safari is doing it for you and not for me, it must be some setting in the preferences. I put a new question in Apple.SE for that: https://apple.stackexchange.com/questions/326430/why-can-a-command-in-safaris-javascript-js-console-not-call-a-previously-defi – WGroleau May 27 '18 at 20:35
  • Totally baffled. Tried it again this morning without changing any Safari settings. Your loop called downloadURI many times, and it downloaded two of 92 files. And I had in the sleep that got a syntax error before, but it gave no error. Unfortunately it also did not pause. I upped the delay to 1500 ms, and tried again. This time it only downloaded one, and both times, there were no delays. Totally weird. Now if it continues to work, perhaps getting the delay to work will solve the "only one" issue. – WGroleau May 28 '18 at 05:10
  • @WGroleau I've added a delay to my loop to download one file every 1.5 seconds. See how this goes for you now? – grg May 29 '18 at 19:10
  • That's siilar to what I finally did, after trying a heck of a lot of other things. I declared i as 1, then delayed each download by 2000 * i++ – WGroleau May 29 '18 at 23:00