2

Right now I have a URL which is populated with a list of .zip files in the browser. I am trying to use rails to download the files and then open them using Zip::File from the rubyzip gem. Currently I am doing this using the typhoeus gem:

response = Typhoeus.get("http://url_with_zip_files.com")

But the response.response_body is an HTML doc inside a string. I am new to programming so a hint in the right direction using best practices would help a lot.

response.response_body => "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<html>\n <head>\n  <title>Index of /mainstream/posts</title>\n </head>\n <body>\n<h1>Index of /mainstream/posts</h1>\n<table><tr><th><a href=\"?C=N;O=D\">Name</a></th><th><a href=\"?C=M;O=A\">Last modified</a></th><th><a href=\"?C=S;O=A\">Size</a></th><th><a href=\"?C=D;O=A\">Description</a></th></tr><tr><th colspan=\"4\"><hr></th></tr>\n<tr><td><a href=\"/5Rh5AMTrc4Pv/mainstream/\">Parent Directory</a></td><td>&nbsp;</td><td align=\"right\">  - </td><td>&nbsp;</td></tr>\n<tr><td><a href=\"1476536091739.zip\">1476536091739.zip</a></td><td align=\"right\">15-Oct-2016 16:01  </td><td align=\"right\"> 10M</td><td>&nbsp;</td></tr>\n<tr><td><a href=\"1476536487496.zip\">1476536487496.zip</a></td><td align=\"right\">15-Oct-2016 16:04  </td><td align=\"right\"> 10M</td><td>&nbsp;</td></tr>"
rebbailey
  • 714
  • 1
  • 7
  • 16
  • You need to give us a better idea of the situation and what the expected result is. "which is populated with a list of .zip files in the browser" do you intend to download each link and unpack it? How do you want the files ordered/categorized? – max Oct 18 '16 at 14:04
  • I will edit with the output. But yes, I would like to download each file and then unpack. – rebbailey Oct 18 '16 at 14:06
  • Then edit the question text so that this question can be answered. http://stackoverflow.com/help/how-to-ask – max Oct 18 '16 at 14:08

2 Answers2

2

To break this down you need to:

  1. Get the initial HTML index page with Typhoeus

      base_url = "http://url_with_zip_files.com/"
      response = Typhoeus.get(base_url)
    
  2. Then Use Nokogiri to parse that HTML to extract all the links to the zip files (see: extract links (URLs), with nokogiri in ruby, from a href html tags?)

    doc = Nokogiri::HTML(response)
    links = doc.css('a').map { |link| link['href'] }
    links.map { |link| base_url + '/' + link}
    
    # Should look like:
    # links = ["http://url_with_zip_files.com/1476536091739.zip", "http://url_with_zip_files.com/1476536487496.zip" ...]
    
    # The first link is a link to Parent Directory which you should probably drop 
    # looks like: "/5Rh5AMTrc4Pv/mainstream/"
    
    links.pop
    
  3. Once you have all the links: you then visit all the extracted links to download the zip files with ruby and unzip them (see: Ruby: Download zip file and extract)

     links.each do |link|
       download_and_parse(link)
     end
    
     def download_and_parse(zip_file_link)
       input = Typhoeus.get(zip_file_link).body
       Zip::InputStream.open(StringIO.new(input)) do |io|
          while entry = io.get_next_entry
              puts entry.name
              parse_zip_content io.read
          end
       end
     end
    

If you want to use Typhoeus to stream the file contents from the url to memory see the Typhoeus documentation section titled: "Streaming the response body". You can also use Typhoeus to download all of the files in paralell which would increase your performance.

Community
  • 1
  • 1
Stefan Lyew
  • 387
  • 2
  • 15
1

I believe Nokogiri will be your best bet.

base_url = "http://url_with_zip_files.com/"
doc = Nokogiri::HTML(Typhoeus.get(base_url))
zip_array = []
doc.search('a').each do |link| 
  if link.attr("href").match /.+\.zip/i
   zip_array << Typhoeus.get(base_url + link.attr("href"))
  end
end
Steven B
  • 61
  • 3
  • Ok, I have done the same and am getting these lines back now: `ETHON: performed EASY effective_url=http://url_with_zip_files.com/mainstream/posts/1476544093681.zip response_code=200 return_code=ok total_time=86.572657`. But it is taking a very long time to download each file. The files are 10M each, is this normal? It seems like it might take me an entire day to download at the rate it is going. – rebbailey Oct 18 '16 at 15:04
  • For files that large, you may want to look into ways to dump the download to disk instead of into an array. – Steven B Oct 18 '16 at 15:07
  • To speed it up, you could try to save each link and launch a new thread for each download. That would give you concurrent downloads at least. – Steven B Oct 18 '16 at 15:11