How to download pdf file in ruby without .pdf in the link

Question

I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.

Please help me in downloading the file.

The link

score 5 · Answer 1 · answered Sep 07 '13 at 04:45

5

You an do this

require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
  file.write open('http://someurl.com/2013-1-2/somefile/download').read
end

I have been doing this for my projects and it works.

answered Sep 07 '13 at 04:45

roxxypoxxy

2,973
1
21
28

If using versions later than Ruby 3.0 to open the uri you should use `URI.open` instead of just `open` described above. Before Ruby 3.0, `open-uri` used to overwrite `Kernel#open` which led to vulnerabilities if the url was not safe. This was changed after Ruby 3.0, and led to the use of `URI.open` instead. – choicodes Mar 16 '23 at 18:25

Peter Klipfel · Accepted Answer · 2013-07-26T23:17:05.120

0

If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'

At that point though, you might as well run wget.

The other way, is to just run a get on the page that you know the pdf is at

source = Net::HTTP.get("http://the.website.com", "/and/some/params")

There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf

In your case, I ran the following commands to get the pdf

wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf

Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.

Update:

There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.

How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.

edited Jul 26 '13 at 23:17

answered Jul 25 '13 at 00:14

Peter Klipfel

4,958
5
29
44

I don't think you understood the question, i have clearly mentioned that i don't have the link ending with .pdf, otherwise it would not be a problem. – Sushil Jul 25 '13 at 05:32
I was under the impression that you had an address that served back a pdf from a rest endpoint. Except that the endpoint didn't have the `.pdf` extension. If this is the case, then all you have to do is ask the server for the stuff at that endpoint, and add the `.pdf` extension when it gets to you. Is there a redirect in there? – Peter Klipfel Jul 25 '13 at 14:44
I am a newbie in the ruby programming language. Can you please provide the intended solution? I have already provided the link above. – Sushil Jul 25 '13 at 17:02
Asking for the answer without an understanding of what your doing will often get you flamed. If I have time later, I will try to post a ruby script that will do it. However, it's important to be willing to google your questions and read documentation and blog posts. If you're new to ruby, you probably want to start doing tutorials. I learned through the Hartl tutorial (for rails) http://ruby.railstutorial.org/. But I've heard good things about http://rubymonk.com/ if you're just interested in ruby. Also, these things take time. People spend hundreds of hours learning new tools – Peter Klipfel Jul 25 '13 at 22:05
I have used ruby from many times before (though i have not taken a complete course) and i have never encountered anything like downloading from the internet(even if you read the tutorial, they will never teach you to do so). I have searched and read many blogs, but every single one ended with the link(.pdf) at its end, but my situation is different. So, if you would not mind could you please guide me on downloading the link that i have mentioned above. – Sushil Jul 26 '13 at 06:08
I think you might be confusing the importance of the file extension. If you run a get request on `http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/` using net::ssh, the file that you get back *is* a pdf file, it's just named wrong. All you have to do is add the `.pdf` extension, and any pdf reader will open it – Peter Klipfel Jul 26 '13 at 14:57
The part /chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/ changes every time the pdf gets updated, so there is no chance of knowing what would it be after some days. Can you provide more definite solution. – Sushil Jul 26 '13 at 18:47

How to download pdf file in ruby without .pdf in the link

2 Answers2

Linked