1

I'd like to develop a "page downloader" in ruby - something that, given a url, will download the html, the associated css, imagefiles and javascripts, and then change the html to reference the local copies instead of remote ones. Much like some browsers do with the "save as complete page" option.

I was thinking about using Nokogiri to do the initial parsing of the page. But I'm not sure it's the best tool for the job:

  • Can it get a list of external dependencies (stylesheets, images, and javascripts). I don't care about javascript-generated dependencies.
  • Does it parse CSS? I might want to download images or @imported css files, too.

Is there a gem that already does what I want?

kikito
  • 51,734
  • 32
  • 149
  • 189
  • Related (but not identical) question: http://stackoverflow.com/questions/1080565/rails-emulate-save-page-as-behaviour – kikito Jun 06 '12 at 10:18
  • You could try with a testing framework controlling a web browser, e.g. Selenium WebDriver + any regular browser or HtmlUnit (headless browser). Might it might be a bit heavy for what you asked about. – echristopherson Jun 06 '12 at 21:25

2 Answers2

1
  1. No, Nokogiri does not know about external dependencies. You can do something like:

    js_urls  = doc.xpath('//script/@src').map(&:content)
    css_urls = doc.xpath('//link/@href').map(&:content)
    img_urls = doc.xpath('//img/@src').map(&:content)
    

    …but that will not find:

    • scripts or CSS loaded dynamically by JavaScript (creating elements and appending them to the document)
      which you say that you don't care about
    • images requested by JavaScript, e.g. var img = new Image; img.src="...";
      which you say that you don't care about
    • CSS linked from CSS, e.g. @import url(foo.css);
    • Images referenced by CSS, e.g. #nav { background:url(/images/navhead.png) }
       

    Further, all the URLs you will get back may be relative to the current URL, so you will need to resolve relative URLs.

  2. No, Nokogiri is an X/HTML DOM library (standing on top of libxml2). It does not parse JavaScript, it does not execute JavaScript, it does not parse CSS, and it cannot apply CSS to a page. It's not a web browser.

Community
  • 1
  • 1
Phrogz
  • 296,393
  • 112
  • 651
  • 745
0

It seems what I wanted is not implemented already. Nokogiri could be used to parse the html, and there are other gems out there who can parse the CSS (i.e. css_parser), but I have personally not used them, and they will probably have issues with modern css (media queries, imports, etc).

kikito
  • 51,734
  • 32
  • 149
  • 189