What is your recommendation of writing a web crawler in Ruby? Any lib better than mechanize?
-
Mechanize is a great tool if you need to navigate a website, fill in forms, authenticate, etc. It isn't a spider because you have to tell it how to do everything. I haven't tried Anemone but its features look good. Whatever you do, make sure that you honor the `robots.txt` file on the site you are running against, or throttle your code back. Ill-behaved spiders can get you banned. Writing a spider isn't that hard; I've written more than I can remember. Writing one that is a good citizen and is robust is a bigger task, so go with a pre-built wheel if you can. – the Tin Man Nov 21 '11 at 14:57
-
I'd recommend looking at "[What are some good Ruby-based web crawlers?](http://stackoverflow.com/questions/4981379/what-are-some-good-ruby-based-web-crawlers/4981595)" – the Tin Man Jan 08 '15 at 18:04
5 Answers
I'd give a try to anemone. It's simple to use, especially if you have to write a simple crawler. In my opinion, It is well designed too. For example, I wrote a ruby script to search for 404 errors on my sites in a very short time.

- 20,564
- 6
- 65
- 59
-
You should post a gist on this as I will be implementing the same functionality soon. Others would probably use it as well. – cha55son Sep 18 '13 at 21:23
If you want just to get pages' content, the simpliest way is to use open-uri
functions. They don't require additional gems. You just have to require 'open-uri'
and... http://ruby-doc.org/stdlib-2.2.2/libdoc/open-uri/rdoc/OpenURI.html
To parse content you can use Nokogiri or other gems, which also can have, for example, useful XPATH-technology. You can find other parsing libraries just here on SO.
You might want to check out wombat that is built on top of Mechanize/Nokogiri and provides a DSL (like Sinatra, for example) to parse pages. Pretty neat :)

- 10,530
- 4
- 41
- 39
I am working on pioneer gem which is not a spider, but a simple asynchronous crawler based on em-synchrony gem

- 82,987
- 33
- 217
- 237
-
1
-
English is not my native language, so I can be wrong, but it seems to me, that crawler is something more general then spider. Spider is a kind of complete stuff: it recursively surfing through links. And pioneer is more like a little framework. You could write your own spider with pioneer and you can do more ;). But you need to do more job manually, to use pioneer but it is more agile. – fl00r Jun 21 '12 at 18:42
-
1According to [Wikipedia](http://en.wikipedia.org/wiki/Web_crawler): "Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots..." This matches up with the [StackOverflow synonyms for web crawler](http://stackoverflow.com/tags/web-crawler/synonyms). – David J. Jun 21 '12 at 19:38
I just released one recently called Klepto. Its got a pretty simple DSL, is built on top of capybara and has lot of cool configuration options.

- 708
- 6
- 9
-
2Would be nice if you could expand your answer and explain more about these cool options and why your library is better for the task. Also be careful when posting links to your own projects, the community can view it as a bit spammy. – Kev Apr 19 '13 at 02:32