9

I am currently using OpenURI to download a file in Ruby. Unfortunately, it seems impossible to get the HTTP headers without downloading the full file:

open(base_url,
  :content_length_proc => lambda {|t|
    if t && 0 < t
      pbar = ProgressBar.create(:total => t)
  end
  },
  :progress_proc => lambda {|s|
    pbar.progress = s if pbar
  }) {|io|
    puts io.size
    puts io.meta['content-disposition']
  }

Running the code above shows that it first downloads the full file and only then prints the header I need.

Is there a way to get the headers before the full file is downloaded, so I can cancel the download if the headers are not what I expect them to be?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
ePirat
  • 1,068
  • 11
  • 20
  • duplicate? http://stackoverflow.com/questions/13916046/display-http-headers-using-openuri?rq=1 – yeyo Jul 04 '13 at 16:12
  • 3
    @Kira no, using the linked answer will first download the full file, exactly what I do _not_ wanted. – ePirat Jul 16 '13 at 05:57
  • Open does not load the whole response in memory. In fact, it does so, but only for responses smaller or equal to 10240 bytes. Larger responses **are going to be streamed** to a `Tempfile`. You can use that knowledge, to actually access the tempfile and to do lean stuff with it. Nothing is happening in memory, unless you want to. See my answer here: https://stackoverflow.com/questions/2263540/how-do-i-download-a-binary-file-over-http/33746205 But if you only want to access headers, you should not use `open`, because it will always read the response. Answers below are good. – Overbryd Aug 10 '18 at 17:26

3 Answers3

11

You can use Net::HTTP for this matter, for example:

require 'net/http'

http = Net::HTTP.start('stackoverflow.com')

resp = http.head('/')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish

Another example, this time getting the header of the wonderful book, Object Orient Programming With ANSI-C:

require 'net/http'

http = Net::HTTP.start('www.planetpdf.com')

resp = http.head('/codecuts/pdfs/ooc.pdf')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
yeyo
  • 2,954
  • 2
  • 29
  • 40
  • 1
    It's cleaner to use the block form of `start`. See the example in [the documentation](http://ruby-doc.org/stdlib-2.0/libdoc/net/http/rdoc/Net/HTTP.html#method-i-head). – the Tin Man Jul 03 '13 at 19:08
  • There are many good reasons to use the block form, including automatically closing the logic, if not the connection, when the block ends. It's programmer's prerogative to do whatever they want, but there should be sound reasons. Indentation getting to deep or the block form not fitting sounds like a need to refactor. – the Tin Man Jul 05 '13 at 00:44
  • Thanks, but this wasn't really what I wanted to archieve (see my answer). It helped a lot anyway to find what I was looking for, thanks. – ePirat Jul 15 '13 at 23:05
  • 3
    @ePirat well actually if you don't want to download the file and you just want to gather information about the file, then a HEAD request is indeed what you want. from RFC2616 sec. 9.4 _This method (HEAD) can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification._ Please visit http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html – yeyo Jul 16 '13 at 15:16
  • @kira Nice explanation. I am having always in trouble to understand the Ruby's `http` lib.. Would you help me on this ? – Arup Rakshit Feb 20 '14 at 22:03
  • @Kira Can you drop me an email, to my email id, so that if I have a problem, I can trigger an email to you..please :-) my email is on my "about me" profile. I just need a help on thus `http/https` lib. – Arup Rakshit Feb 21 '14 at 06:42
  • In the last example `open-uri` still downloads the whole file. It can be seen by using `progress_proc` param. – Nakilon Feb 09 '15 at 12:16
5

It seems what I wanted is not possible to archieve using OpenURI, at least not, as I said, without loading the whole file first.

I was able to do what I wanted using Net::HTTP's request_get

Here an example:

http.request_get('/largefile.jpg') {|response|
  if (response['content-length'] < max_length)
    response.read_body do |str|   # read body now
      # save to file
    end
  end
}

Note that this only works when using a block, doing it like:

response = http.request_get('/largefile.jpg')

the body will already be read.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
ePirat
  • 1,068
  • 11
  • 20
3

Rather than use Net::HTTP, which can be like digging a pool on the beach using a sand shovel, you can use a number of the HTTP clients for Ruby and clean up the code.

Here's a sample using HTTParty:

require 'httparty'

resp = HTTParty.head('http://example.org')
resp.headers
# => {"accept-ranges"=>["bytes"], "cache-control"=>["max-age=604800"], "content-type"=>["text/html"], "date"=>["Thu, 02 Mar 2017 18:52:42 GMT"], "etag"=>["\"359670651\""], "expires"=>["Thu, 09 Mar 2017 18:52:42 GMT"], "last-modified"=>["Fri, 09 Aug 2013 23:54:35 GMT"], "server"=>["ECS (oxr/83AB)"], "x-cache"=>["HIT"], "content-length"=>["1270"], "connection"=>["close"]}

At that point it's easy to check the size of the document:

resp.headers['content-length'] # => "1270"

Unfortunately, the HTTPd you're talking to might not know how big the content will be; In order to respond quickly servers don't necessarily calculate the size of dynamically generated output, which would take almost as long and be almost as CPU intensive as actually sending it, so relying on the "content-length" value might be buggy.

The issue with Net::HTTP is it won't automatically handle redirects, so then you have to add additional code. Granted, that code is supplied in the documentation, but the code keeps growing as you need to do more things, until you've ended up writing yet another http client (YAHC). So, avoid that and use an existing wheel.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • If I understand the code correctly, this does actually a HEAD request, which is not what I wanted, in this specific case. Even though that would probably be a good way in general to solve this, I had to use a GET request in this case. – ePirat Mar 03 '17 at 00:45
  • A GET will always try to retrieve the entire file. It is possible to get inside the processing and abort the connection, but that is not being a good network citizen. Consider what happens: You issue a GET and the server loads the file to start sending it. You abort, and you've just caused extra load on the server and the intervening network and your host. That's why HEAD was invented, to avoid doing that. – the Tin Man Mar 03 '17 at 00:55
  • As I said, I am aware of that, but in the specific case HEAD was not working, so GET was the only option. And I wanted to avoid to download the full file just to throw it away, so I thought being able to abort as early as possible, not after downloading the whole file, would be a good thing. – ePirat Mar 03 '17 at 17:40