5

I only need to download the first few kilobytes of a file via HTTP.

I tried

require 'open-uri'
url = 'http://example.com/big-file.dat'
file = open(url)
content = file.read(limit)

But it actually downloads the full file.

Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338
taro
  • 5,772
  • 2
  • 30
  • 34

3 Answers3

4

This seems to work when using sockets:

require 'socket'                  
host = "download.thinkbroadband.com"                 
path = "/1GB.zip" # get 1gb sample file
request = "GET #{path} HTTP/1.0\r\n\r\n"
socket = TCPSocket.open(host,80) 
socket.print(request)        

# find beginning of response body
buffer = ""                    
while !buffer.match("\r\n\r\n") do
  buffer += socket.read(1)  
end           

response = socket.read(100) #read first 100 bytes of body
puts response

I'm curious if there is a "ruby way".

Michel de Graaf
  • 251
  • 2
  • 5
  • Hi Michel, for some reason whenever I try a file such as `http://www.forcefieldpr.com/asdyoucantbealone.mp3`, which works in the browser, I keep getting a 404 html page. Would this be to do with the request? – Aaron Moodie Jun 24 '10 at 09:18
  • I submitted an edit that fixes the issue @AaronMoodie has. Some web servers need the "Host" header so I added just that: `request = "GET #{path} HTTP/1.1\r\nHost: #{host}\r\n\r\n"` – inket Aug 10 '13 at 08:54
4

This is an old thread, but it's still a question that seems mostly unanswered according to my research. Here's a solution I came up with by monkey-patching Net::HTTP a bit:

require 'net/http'

# provide access to the actual socket
class Net::HTTPResponse
  attr_reader :socket
end

uri = URI("http://www.example.com/path/to/file")
begin
  Net::HTTP.start(uri.host, uri.port) do |http|
    request = Net::HTTP::Get.new(uri.request_uri)
    # calling request with a block prevents body from being read
    http.request(request) do |response|
      # do whatever limited reading you want to do with the socket
      x = response.socket.read(100);
    end
  end
rescue IOError
  # ignore
end

The rescue catches the IOError that's thrown when you call HTTP.finish prematurely.

FYI, the socket within the HTTPResponse object isn't a true IO object (it's an internal class called BufferedIO), but it's pretty easy to monkey-patch that, too, to mimic the IO methods you need. For example, another library I was using (exifr) needed the readchar method, which was easy to add:

class Net::BufferedIO
  def readchar
    read(1)[0].ord
  end
end
zed_0xff
  • 32,417
  • 7
  • 53
  • 72
Dustin Frazier
  • 293
  • 2
  • 9
  • Great! You can access the socket without patching by the way, just use: `response.instance_variable_get(:@socket).read(5120)` – inket Apr 08 '13 at 01:02
  • This solution gets stuck indefinitely with ruby-2.0.0p247 under OS X 10.9. Couldn't narrow the problem down, but the backtrace mentions line 155 in `net/protocol.rb`. – inket Aug 10 '13 at 09:01
0

Check out "OpenURI returns two different objects". You might be able to abuse the methods in there to interrupt downloading/throw away the rest of the result after a preset limit.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Joel Meador
  • 2,586
  • 2
  • 19
  • 24