Anemone Ruby spider - create key value array without domain name

Question

I'm using Anemone to spider a domain and it works fine.

the code to initiate the crawl looks like this:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

This very nicely prints out all the page urls for the domain like so:

http://www.example.com/
http://www.example.com/about
http://www.example.com/articles
http://www.example.com/articles/article_01
http://www.example.com/contact

What I would like to do is create an array of key value pairs using the last part of the url for the key, and the url 'minus the domain' for the value.

E.g.

[
   ['','/'],
   ['about','/about'],
   ['articles','/articles'],
   ['article_01','/articles/article_01']
]

Apologies if this is rudimentary stuff but I'm a Ruby novice.

What you've described as the wanted output doesn't contain any key-value pairs (ie. hashes). It's all arrays. — Agis, Oct 23 '13 at 12:31
As stated I'm a Ruby novice so my mark-up illustration and terminology may not be correct. But if you have anything more constructive to offer that'd be lovely. — boldfacedesignuk, Oct 23 '13 at 22:28
Understanding the difference between `Hash` (key-value pairs) and `Array` (ordered list of objects) is very important, so you might as well call this comment "constructive". — Agis, Oct 24 '13 at 06:33

Sean Larkin · Answer 1 · 2013-10-23T12:38:53.543

2

I would define an array or hash first outside of the block of code and then add your key value pairs to it:

require 'anemone'

path_array = []
crawl_url = "http://www.example.com/"    

Anemone.crawl(crawl_url) do |anemone|
  anemone.on_every_page do |page|
    path_array << page.url
    puts page.url
  end
end

From here you can then .map your array into a useable multi-dimensional array:

path_array.map{|x| [x[crawl_url.length..10000], x.gsub("http://www.example.com","")]}

=> [["", "/"], ["about", "/about"], ["articles", "/articles"], ["articles/article_01", "/articles/article_01"], ["contact", "/contact"]]

I'm not sure if it will work in every scenario, however I think this can give you a good start for how to collect the data and manipulate it. Also if you are wanting a key/value pair you should look into Ruby's class Hash for more information on how to use and create hash's in Ruby.

edited Oct 23 '13 at 12:38

answered Oct 23 '13 at 12:33

Sean Larkin

6,290
1
28
43

I gave this a shot and discovered Anemone is maybe using a Hash as I get an error 'undefined method 'gsub' for #' – boldfacedesignuk Oct 23 '13 at 16:32
It could be the case that .gsub doesn't work on URI objects (maybe that is what is returning in the path_array. I wonder if maybe you can convert that object with a method that turns it into a string of the url, then you can perform gsub. – Sean Larkin Oct 24 '13 at 12:04
So I tested this a little more myself and am right in saying that it returns URI objects. I think in this case you could simply just do path_array.map{|x| [x.host, x.path]} – Sean Larkin Oct 24 '13 at 12:09

score 0 · Answer 2 · answered Oct 23 '13 at 11:59

The simplest and possibly least robust way to do this would be to use

page.url.split('/').last

to obtain your 'key'. You would need to test various edge cases to ensure it worked reliably.

edit: this will return 'www.example.com' as the key for 'http://www.example.com/' which is not the result you require

Anemone Ruby spider - create key value array without domain name

2 Answers2