Questions tagged [anemone]

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/

38 questions
8
votes
2 answers

Ruby, Mongodb, Anemone: web crawler with possible memory leak?

I began to learn about web crawlers recently and I built a sample crawler with Ruby, Anemone, and Mongodb for storage. I'm testing the crawler on a massive public website with possibly billions of links. The crawler.rb is indexing the correct…
viotech
  • 105
  • 1
  • 10
4
votes
2 answers

Regular expression in Ruby

http://www.example.com/books?_pop=mheader What would be the regular expression to match this and any URL that has "books" in the URLs as one of the pattern matches ? This site has a books category and various other sub-categories under that. How do…
Aayush
  • 1,244
  • 5
  • 19
  • 48
3
votes
2 answers

Crawling sub-domain with Anemone

I am using Anemone. How do I crawl sub-domain too? for e.g if I have website www.abc.com my crawler should also crawl support.abc.com or blah.abc.com. I am using Ruby 1.8.7 and Rails 3.
Bhushan Lodha
  • 6,824
  • 7
  • 62
  • 100
3
votes
2 answers

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

Suppose I was trying crawl a website a skip a page that ended like so: http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like…
sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
3
votes
1 answer

Skipping web-pages with extension pdf, zip from crawling in Anemone

I am developing crawler using anemone gem (Ruby- 1.8.7 and Rails 3.1.1). How should I skip web-pages with extensions pdf, doc, zip, etc. from crawling/downloading.
Bhushan Lodha
  • 6,824
  • 7
  • 62
  • 100
3
votes
1 answer

Ruby Anemone spider adding a tag to each url visited

I have a crawl set up: require 'anemone' Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone| anemone.on_every_page do |page| puts page.url end end However I want the spider to use a Google-analytics anti-tracking tag on…
Benjamin
  • 551
  • 5
  • 25
3
votes
0 answers

Can Anemone keep previously stored pages when recrawling

I just learned about Anemone the spider framework. Its site said Note: Every storage engine will clear out existing Anemone data before beginning a new crawl. Question: I am wondering if I can avoid this, i.e. keep what has been crawled, and…
lulalala
  • 17,572
  • 15
  • 110
  • 169
3
votes
1 answer

Error in fetching a list of urls from a website using anemone

Code: require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end When I try this code I should get a list of all the urls on that website but all I get is just the name of…
Anu11
  • 316
  • 1
  • 10
2
votes
1 answer

anemone ignore url links including a certain phrase

I am running a web scraper with anemone on ruby and I am giving my server some problems when it visits pages that require a logon. The pages all have a phrase, say, "account" in the url and I want the program to completely ignore and not go to any…
Benjamin
  • 551
  • 5
  • 25
2
votes
1 answer

Getting all the domains a page depends on using Nokogiri

I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting: Image URLs
Jamie McCrindle
  • 9,114
  • 6
  • 43
  • 48
2
votes
2 answers

Crawl page which requires login with Anemone

I'm using Anemone gem in following way: Visit first url (seed), save page content to database and save all links from this page to database as well (all links which are not in database yet) Load next link from database, save its content and any…
kmaci
  • 3,094
  • 2
  • 18
  • 28
2
votes
1 answer

gems/anemone-0.7.2/lib/anemone/storage.rb:28:in `MongoDB': uninitialized constant Mongo::Connection (NameError)

Using Anemone, I get this error when trying to use MongoDB: gems/anemone-0.7.2/lib/anemone/storage.rb:28:in `MongoDB': uninitialized constant Mongo::Connection (NameError) The code looks like this: require 'anemone' require…
Snowcrash
  • 80,579
  • 89
  • 266
  • 376
2
votes
3 answers

Prevent fake analytics statistics with custom crawler

Is there a way to prevent faked Google Analytics statistics by using PhantomJS and/or a ruby crawler like Anemone? Our monitoring tool (which is based on both of them) crawls the sites from our clients and updates the link status of each link in a…
Scribdarock
  • 63
  • 1
  • 6
2
votes
2 answers

Anemone Ruby spider - create key value array without domain name

I'm using Anemone to spider a domain and it works fine. the code to initiate the crawl looks like this: require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end This…
boldfacedesignuk
  • 1,643
  • 2
  • 11
  • 15
2
votes
1 answer

Matching URL structures with Anemone

Right now, I'm doing the following with Anemone: Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone| anemone.on_every_page do | page | But I would like to do     Anemone.crawl("http://www.findbrowsenodes.com/", :delay =>…
alexchenco
  • 53,565
  • 76
  • 241
  • 413
1
2 3