3

I am downloading part of an HTML page by:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('https://example.com/index.html'))
wiki = doc./('//*[@id="wiki"]/div[1]')

and I need the stylesheets in order to display it correctly. They are included in the header like so:

<!DOCTYPE html>
<html lang="en" class="">
    <head>
    ...
    <link href="https://example.com/9f40a.css" media="all" rel="stylesheet" />
    <link href="https://example.com/4e5fb.css" media="all" rel="stylesheet" />
    ...
  </head>
  ...

and their naming can be changed. How do I parse/download local copies of the stylesheets?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Jasmine Lognnes
  • 6,597
  • 9
  • 38
  • 58

2 Answers2

4

Something like this:

require 'open-uri'
doc.css("head link").each do |tag|
  link = tag["href"]
  next unless link && link.end_with?("css")
  File.open("/tmp/#{File.basename(link)}", "w") do |f|
    content = open(link) { |g| g.read }
    f.write(content)
  end
end
mrbrdo
  • 7,968
  • 4
  • 32
  • 36
  • If I change `/path` to `/tmp` thyen I get `/usr/share/ruby/open-uri.rb:353:in `open_http': 403 Forbidden (OpenURI::HTTPError)`. If I change the `open` command to `open(link["href"], "User-Agent" => "Mozilla/5.0 (Windows NT 6.0; rv:12.0) Gecko/20100101 Firefox/12.0 FirePHP/0.7.1")` then I get `/usr/share/ruby/open-uri.rb:36:in `initialize': no implicit conversion of Hash into String (TypeError)` – Jasmine Lognnes Apr 20 '15 at 10:56
  • If you need to set headers you will have to use Net::HTTP or Mechanize or something like that. I don't believe open-uri supports this. I don't think this is directly related to the question though. See this for Net::HTTP example with headers: http://stackoverflow.com/questions/587559/how-to-make-an-http-get-with-modified-headers – mrbrdo Apr 20 '15 at 10:58
  • I just tried with `wget` and it can download the `css` files without an user-agent string. I think perhaps some of the other links (not shown in the OP) in the `head` requires user-agent. Is there a way to have your script only download the css files? – Jasmine Lognnes Apr 20 '15 at 11:06
  • It's exactly what it does. It downloads the CSS files. Like I said, if you need to pass headers then use Net::HTTP instead of open-uri (open). – mrbrdo Apr 20 '15 at 11:27
  • I have modified your code to only download css files. Yours downloaded all links, not just css. – Jasmine Lognnes Apr 20 '15 at 11:36
  • Thanks @JasmineLognnes I took your change into account and updated the answer. You should note that in your suggested regexp you should have used `\z` instead of `$`. – mrbrdo Apr 20 '15 at 12:57
  • Instead of opening a file with a block to write, use `File.write("/tmp/#{File.basename(link)}", open(link).read)` – the Tin Man Apr 21 '15 at 20:07
  • @theTinMan I agree for writing, but reading like you propose I think is not a good idea as you never call `close` on the File object. It probably gets GC'ed at some point but it's much better to use the block so it is closed immediately after use. – mrbrdo Apr 28 '15 at 13:00
  • @mrbrdo it's not worth worrying about when reading if only one file is being read. It becomes an issue if multiple files are read causing the system's file handles to be used up. It's very idiomatic to not bother when processing HTML/XML. – the Tin Man Apr 28 '15 at 16:00
  • @mrbrdo, additional testing seems to show that using `read` with OpenURI automatically closes the incoming HTTP stream, which I'd expect. Using `lsof -i` to track the open connections shows nothing open after stepping over an `open(...).read`, so it appears to be even less of a concern. Also, `File.write` automatically closes the output stream, so nothing could be left hanging. – the Tin Man Apr 28 '15 at 16:51
  • `open(...).read` does not close the FD. Of course if you are stepping over the GC will trigger before u had a chance to run `lsof`. Trust me it is a concern, I have projects where I had issues because of this and it is not easy to debug after u already have this problematic code. Like I said I agree for writing. – mrbrdo Apr 28 '15 at 22:07
1

I'm not a ruby expert but you can go over following steps

  • You can use .scan(...) method provided with String type to parse and get the .css file names. The scan method will return you an array stylesheet file names. Find more info on scan here
  • Then download and store the files with Net::HTTP.get(...) an example is here
Community
  • 1
  • 1
deimus
  • 9,565
  • 12
  • 63
  • 107
  • The majority of the time OpenUri would be preferable over using Net::HTTP because of automatic redirection and ease of use. – the Tin Man Apr 28 '15 at 16:04