How to download each zip file from a url and unpack using rails

Question

Right now I have a URL which is populated with a list of .zip files in the browser. I am trying to use rails to download the files and then open them using Zip::File from the rubyzip gem. Currently I am doing this using the typhoeus gem:

response = Typhoeus.get("http://url_with_zip_files.com")

But the response.response_body is an HTML doc inside a string. I am new to programming so a hint in the right direction using best practices would help a lot.

response.response_body => "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<html>\n <head>\n  <title>Index of /mainstream/posts</title>\n </head>\n <body>\n<h1>Index of /mainstream/posts</h1>\n<table><tr><th><a href=\"?C=N;O=D\">Name</a></th><th><a href=\"?C=M;O=A\">Last modified</a></th><th><a href=\"?C=S;O=A\">Size</a></th><th><a href=\"?C=D;O=A\">Description</a></th></tr><tr><th colspan=\"4\"><hr></th></tr>\n<tr><td><a href=\"/5Rh5AMTrc4Pv/mainstream/\">Parent Directory</a></td><td>&nbsp;</td><td align=\"right\">  - </td><td>&nbsp;</td></tr>\n<tr><td><a href=\"1476536091739.zip\">1476536091739.zip</a></td><td align=\"right\">15-Oct-2016 16:01  </td><td align=\"right\"> 10M</td><td>&nbsp;</td></tr>\n<tr><td><a href=\"1476536487496.zip\">1476536487496.zip</a></td><td align=\"right\">15-Oct-2016 16:04  </td><td align=\"right\"> 10M</td><td>&nbsp;</td></tr>"

You need to give us a better idea of the situation and what the expected result is. "which is populated with a list of .zip files in the browser" do you intend to download each link and unpack it? How do you want the files ordered/categorized? — max, Oct 18 '16 at 14:04
I will edit with the output. But yes, I would like to download each file and then unpack. — rebbailey, Oct 18 '16 at 14:06
Then edit the question text so that this question can be answered. http://stackoverflow.com/help/how-to-ask — max, Oct 18 '16 at 14:08

score 2 · Answer 1 · edited May 23 '17 at 12:23

To break this down you need to:

Get the initial HTML index page with Typhoeus

  base_url = "http://url_with_zip_files.com/"
  response = Typhoeus.get(base_url)

Then Use Nokogiri to parse that HTML to extract all the links to the zip files (see: extract links (URLs), with nokogiri in ruby, from a href html tags?)

doc = Nokogiri::HTML(response)
links = doc.css('a').map { |link| link['href'] }
links.map { |link| base_url + '/' + link}

# Should look like:
# links = ["http://url_with_zip_files.com/1476536091739.zip", "http://url_with_zip_files.com/1476536487496.zip" ...]

# The first link is a link to Parent Directory which you should probably drop 
# looks like: "/5Rh5AMTrc4Pv/mainstream/"

links.pop

Once you have all the links: you then visit all the extracted links to download the zip files with ruby and unzip them (see: Ruby: Download zip file and extract)

 links.each do |link|
   download_and_parse(link)
 end

 def download_and_parse(zip_file_link)
   input = Typhoeus.get(zip_file_link).body
   Zip::InputStream.open(StringIO.new(input)) do |io|
      while entry = io.get_next_entry
          puts entry.name
          parse_zip_content io.read
      end
   end
 end

If you want to use Typhoeus to stream the file contents from the url to memory see the Typhoeus documentation section titled: "Streaming the response body". You can also use Typhoeus to download all of the files in paralell which would increase your performance.

Steven B · Answer 2 · 2016-10-18T14:40:54.713

1

I believe Nokogiri will be your best bet.

base_url = "http://url_with_zip_files.com/"
doc = Nokogiri::HTML(Typhoeus.get(base_url))
zip_array = []
doc.search('a').each do |link| 
  if link.attr("href").match /.+\.zip/i
   zip_array << Typhoeus.get(base_url + link.attr("href"))
  end
end

edited Oct 18 '16 at 14:40

answered Oct 18 '16 at 14:24

Steven B

61
3

Ok, I have done the same and am getting these lines back now: `ETHON: performed EASY effective_url=http://url_with_zip_files.com/mainstream/posts/1476544093681.zip response_code=200 return_code=ok total_time=86.572657`. But it is taking a very long time to download each file. The files are 10M each, is this normal? It seems like it might take me an entire day to download at the rate it is going. – rebbailey Oct 18 '16 at 15:04
For files that large, you may want to look into ways to dump the download to disk instead of into an array. – Steven B Oct 18 '16 at 15:07
To speed it up, you could try to save each link and launch a new thread for each download. That would give you concurrent downloads at least. – Steven B Oct 18 '16 at 15:11

How to download each zip file from a url and unpack using rails

2 Answers2