2

I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:

  • Image URLs <img src="..."
  • Javascript URLs <script src="..."
  • CSS and any CSS url(...) elements
  • Frames and IFrames

I'd also want to follow any CSS imports.

Any suggestions / help would be appreciated. The project is already using Anemone.

Here's what I have at the moment.

Anemone.crawl(site, :depth_limit => 1) do |anemone|
  anemone.on_every_page do |page|
    page.doc.xpath('//img').each do |link|
      process_dependency(page, link[:src])
    end
    page.doc.xpath('//script').each do |link|
      process_dependency(page, link[:src])
    end
    page.doc.xpath('//link').each do |link|
      process_dependency(page, link[:href])
    end
    puts page.url
  end
end

Code would be great but I'm really just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.

Benjamin
  • 551
  • 5
  • 25
Jamie McCrindle
  • 9,114
  • 6
  • 43
  • 48

1 Answers1

1

Get the content of the page, then you can extract an array of URIs from the page with

require 'uri'    
URI.extract(page)

After that it's just a matter of using a regular expression to parse each link and extract the domain name.

eugen
  • 8,916
  • 11
  • 57
  • 65
  • 1
    This is what I would use until I saw that it wasn't sufficient. Then I would use Nokogiri to go after individual tags, and use `extract` to go after anything in a `CDATA` string. – the Tin Man Jul 29 '11 at 23:01