Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.
Questions tagged [anemone]
38 questions
8
votes
2 answers
Ruby, Mongodb, Anemone: web crawler with possible memory leak?
I began to learn about web crawlers recently and I built a sample crawler with Ruby, Anemone, and Mongodb for storage. I'm testing the crawler on a massive public website with possibly billions of links.
The crawler.rb is indexing the correct…

viotech
- 105
- 1
- 10
4
votes
2 answers
Regular expression in Ruby
http://www.example.com/books?_pop=mheader
What would be the regular expression to match this and any URL that has "books" in the URLs as one of the pattern matches ? This site has a books category and various other sub-categories under that. How do…

Aayush
- 1,244
- 5
- 19
- 48
3
votes
2 answers
Crawling sub-domain with Anemone
I am using Anemone. How do I crawl sub-domain too? for e.g if I have website www.abc.com my crawler should also crawl support.abc.com or blah.abc.com. I am using Ruby 1.8.7 and Rails 3.

Bhushan Lodha
- 6,824
- 7
- 62
- 100
3
votes
2 answers
Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits
Suppose I was trying crawl a website a skip a page that ended like so:
http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117
I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like…

sunnyrjuneja
- 6,033
- 2
- 32
- 51
3
votes
1 answer
Skipping web-pages with extension pdf, zip from crawling in Anemone
I am developing crawler using anemone gem (Ruby- 1.8.7 and Rails 3.1.1). How should I skip web-pages with extensions pdf, doc, zip, etc. from crawling/downloading.

Bhushan Lodha
- 6,824
- 7
- 62
- 100
3
votes
1 answer
Ruby Anemone spider adding a tag to each url visited
I have a crawl set up:
require 'anemone'
Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
However I want the spider to use a Google-analytics anti-tracking tag on…

Benjamin
- 551
- 5
- 25
3
votes
0 answers
Can Anemone keep previously stored pages when recrawling
I just learned about Anemone the spider framework. Its site said
Note: Every storage engine will clear out existing Anemone data before beginning a new crawl.
Question: I am wondering if I can avoid this, i.e. keep what has been crawled, and…

lulalala
- 17,572
- 15
- 110
- 169
3
votes
1 answer
Error in fetching a list of urls from a website using anemone
Code:
require 'anemone'
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
When I try this code I should get a list of all the urls on that website but all I get is just the name of…

Anu11
- 316
- 1
- 10
2
votes
1 answer
anemone ignore url links including a certain phrase
I am running a web scraper with anemone on ruby and I am giving my server some problems when it visits pages that require a logon.
The pages all have a phrase, say, "account" in the url and I want the program to completely ignore and not go to any…

Benjamin
- 551
- 5
- 25
2
votes
1 answer
Getting all the domains a page depends on using Nokogiri
I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:
Image URLs

Jamie McCrindle
- 9,114
- 6
- 43
- 48
2
votes
2 answers
Crawl page which requires login with Anemone
I'm using Anemone gem in following way:
Visit first url (seed), save page content to database and save all links from this page to database as well (all links which are not in database yet)
Load next link from database, save its content and any…

kmaci
- 3,094
- 2
- 18
- 28
2
votes
1 answer
gems/anemone-0.7.2/lib/anemone/storage.rb:28:in `MongoDB': uninitialized constant Mongo::Connection (NameError)
Using Anemone, I get this error when trying to use MongoDB:
gems/anemone-0.7.2/lib/anemone/storage.rb:28:in `MongoDB': uninitialized constant Mongo::Connection (NameError)
The code looks like this:
require 'anemone'
require…

Snowcrash
- 80,579
- 89
- 266
- 376
2
votes
3 answers
Prevent fake analytics statistics with custom crawler
Is there a way to prevent faked Google Analytics statistics by using PhantomJS and/or a ruby crawler like Anemone?
Our monitoring tool (which is based on both of them) crawls the sites from our clients and updates the link status of each link in a…

Scribdarock
- 63
- 1
- 6
2
votes
2 answers
Anemone Ruby spider - create key value array without domain name
I'm using Anemone to spider a domain and it works fine.
the code to initiate the crawl looks like this:
require 'anemone'
Anemone.crawl("http://www.example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
This…

boldfacedesignuk
- 1,643
- 2
- 11
- 15
2
votes
1 answer
Matching URL structures with Anemone
Right now, I'm doing the following with Anemone:
Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone|
anemone.on_every_page do | page |
But I would like to do
Anemone.crawl("http://www.findbrowsenodes.com/", :delay =>…

alexchenco
- 53,565
- 76
- 241
- 413