Questions tagged [anemone]

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/

38 questions

votes

2 answers

Ruby, Mongodb, Anemone: web crawler with possible memory leak?

I began to learn about web crawlers recently and I built a sample crawler with Ruby, Anemone, and Mongodb for storage. I'm testing the crawler on a massive public website with possibly billions of links. The crawler.rb is indexing the correct…

asked Feb 22 '12 at 12:46

viotech

votes

2 answers

Regular expression in Ruby

http://www.example.com/books?_pop=mheader What would be the regular expression to match this and any URL that has "books" in the URLs as one of the pattern matches ? This site has a books category and various other sub-categories under that. How do…

ruby regex anemone

asked Sep 07 '12 at 05:13

Aayush

1,244
5
19
48

votes

2 answers

Crawling sub-domain with Anemone

I am using Anemone. How do I crawl sub-domain too? for e.g if I have website www.abc.com my crawler should also crawl support.abc.com or blah.abc.com. I am using Ruby 1.8.7 and Rails 3.

ruby web-crawler anemone

asked Feb 15 '12 at 07:16

Bhushan Lodha

6,824
7
62
100

votes

2 answers

Ruby+Anemone Web Crawler: regex to match URLs ending in a series of digits

Suppose I was trying crawl a website a skip a page that ended like so: http://HIDDENWEBSITE.com/anonimize/index.php?page=press_and_news&subpage=20060117 I am currently using Anemone gem in Ruby to build the crawler. I am using the skip_links_like…

ruby regex ruby-on-rails-3 web-crawler anemone

asked Dec 01 '11 at 23:03

sunnyrjuneja

6,033
2
32
51

votes

1 answer

Skipping web-pages with extension pdf, zip from crawling in Anemone

I am developing crawler using anemone gem (Ruby- 1.8.7 and Rails 3.1.1). How should I skip web-pages with extensions pdf, doc, zip, etc. from crawling/downloading.

ruby ruby-on-rails-3 ruby-on-rails-3.1 web-crawler anemone

asked Dec 01 '11 at 12:14

Bhushan Lodha

6,824
7
62
100

votes

1 answer

Ruby Anemone spider adding a tag to each url visited

I have a crawl set up: require 'anemone' Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone| anemone.on_every_page do |page| puts page.url end end However I want the spider to use a Google-analytics anti-tracking tag on…

ruby web-crawler anemone

asked Sep 08 '11 at 10:56

Benjamin

votes

0 answers

Can Anemone keep previously stored pages when recrawling

I just learned about Anemone the spider framework. Its site said Note: Every storage engine will clear out existing Anemone data before beginning a new crawl. Question: I am wondering if I can avoid this, i.e. keep what has been crawled, and…

ruby web-crawler anemone

asked Nov 23 '12 at 04:08

lulalala

17,572
15
110
169

votes

1 answer

Error in fetching a list of urls from a website using anemone

Code: require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end When I try this code I should get a list of all the urls on that website but all I get is just the name of…

ruby anemone

asked Sep 04 '12 at 08:51

Anu11

votes

1 answer

anemone ignore url links including a certain phrase

I am running a web scraper with anemone on ruby and I am giving my server some problems when it visits pages that require a logon. The pages all have a phrase, say, "account" in the url and I want the program to completely ignore and not go to any…

ruby web-scraping anemone

asked Sep 06 '11 at 09:52

Benjamin

votes

1 answer

Getting all the domains a page depends on using Nokogiri

I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting: Image URLs

ruby nokogiri anemone

asked Jul 29 '11 at 13:43

Jamie McCrindle

9,114
6
43
48

votes

2 answers

Crawl page which requires login with Anemone

I'm using Anemone gem in following way: Visit first url (seed), save page content to database and save all links from this page to database as well (all links which are not in database yet) Load next link from database, save its content and any…

ruby-on-rails ruby web-crawler mechanize-ruby anemone

asked Apr 16 '16 at 07:26

kmaci

3,094
2
18
28

votes

1 answer

gems/anemone-0.7.2/lib/anemone/storage.rb:28:in `MongoDB': uninitialized constant Mongo::Connection (NameError)

Using Anemone, I get this error when trying to use MongoDB: gems/anemone-0.7.2/lib/anemone/storage.rb:28:in `MongoDB': uninitialized constant Mongo::Connection (NameError) The code looks like this: require 'anemone' require…

ruby mongodb rubygems anemone

asked Aug 28 '15 at 15:34

Snowcrash

80,579
89
266
376

votes

3 answers

Prevent fake analytics statistics with custom crawler

Is there a way to prevent faked Google Analytics statistics by using PhantomJS and/or a ruby crawler like Anemone? Our monitoring tool (which is based on both of them) crawls the sites from our clients and updates the link status of each link in a…

google-analytics phantomjs robot anemone

asked Nov 15 '13 at 10:21

Scribdarock

votes

2 answers

Anemone Ruby spider - create key value array without domain name

I'm using Anemone to spider a domain and it works fine. the code to initiate the crawl looks like this: require 'anemone' Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| puts page.url end end This…

ruby anemone

asked Oct 23 '13 at 11:55

boldfacedesignuk

1,643
2
11
15

votes

1 answer

Matching URL structures with Anemone

Right now, I'm doing the following with Anemone: Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone| anemone.on_every_page do | page | But I would like to do Anemone.crawl("http://www.findbrowsenodes.com/", :delay =>…

ruby anemone

asked Sep 04 '13 at 10:02

alexchenco

53,565
76
241
413

2 3 Next