I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:
- Image URLs <img src="..."
- Javascript URLs <script src="..."
- CSS and any CSS url(...) elements
- Frames and IFrames
I'd also want to follow any CSS imports.
Any suggestions / help would be appreciated. The project is already using Anemone.
Here's what I have at the moment.
Anemone.crawl(site, :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
page.doc.xpath('//img').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//script').each do |link|
process_dependency(page, link[:src])
end
page.doc.xpath('//link').each do |link|
process_dependency(page, link[:href])
end
puts page.url
end
end
Code would be great but I'm really just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.