Getting all the domains a page depends on using Nokogiri

Question

I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:

Image URLs <img src="..."
Javascript URLs <script src="..."
CSS and any CSS url(...) elements
Frames and IFrames

I'd also want to follow any CSS imports.

Any suggestions / help would be appreciated. The project is already using Anemone.

Here's what I have at the moment.

Anemone.crawl(site, :depth_limit => 1) do |anemone|
  anemone.on_every_page do |page|
    page.doc.xpath('//img').each do |link|
      process_dependency(page, link[:src])
    end
    page.doc.xpath('//script').each do |link|
      process_dependency(page, link[:src])
    end
    page.doc.xpath('//link').each do |link|
      process_dependency(page, link[:href])
    end
    puts page.url
  end
end

Code would be great but I'm really just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.

A CSS parser shouldn't be necessary. `URI::extract` should find the URIs in the CDATA. — the Tin Man, Jul 29 '11 at 23:02

score 1 · Answer 1 · answered Jul 29 '11 at 14:05

1

Get the content of the page, then you can extract an array of URIs from the page with

require 'uri'    
URI.extract(page)

After that it's just a matter of using a regular expression to parse each link and extract the domain name.

answered Jul 29 '11 at 14:05

eugen

8,916
11
57
65

1

This is what I would use until I saw that it wasn't sufficient. Then I would use Nokogiri to go after individual tags, and use `extract` to go after anything in a `CDATA` string. – the Tin Man Jul 29 '11 at 23:01

Getting all the domains a page depends on using Nokogiri

1 Answers1