Web crawler in ruby

Question

What is your recommendation of writing a web crawler in Ruby? Any lib better than mechanize?

Mechanize is a great tool if you need to navigate a website, fill in forms, authenticate, etc. It isn't a spider because you have to tell it how to do everything. I haven't tried Anemone but its features look good. Whatever you do, make sure that you honor the `robots.txt` file on the site you are running against, or throttle your code back. Ill-behaved spiders can get you banned. Writing a spider isn't that hard; I've written more than I can remember. Writing one that is a good citizen and is robust is a bigger task, so go with a pre-built wheel if you can. — the Tin Man, Nov 21 '11 at 14:57
I'd recommend looking at "[What are some good Ruby-based web crawlers?](http://stackoverflow.com/questions/4981379/what-are-some-good-ruby-based-web-crawlers/4981595)" — the Tin Man, Jan 08 '15 at 18:04

score 25 · Answer 1 · answered Nov 09 '10 at 11:31

25

I'd give a try to anemone. It's simple to use, especially if you have to write a simple crawler. In my opinion, It is well designed too. For example, I wrote a ruby script to search for 404 errors on my sites in a very short time.

answered Nov 09 '10 at 11:31

lucapette

20,564
6
65
59

You should post a gist on this as I will be implementing the same functionality soon. Others would probably use it as well. – cha55son Sep 18 '13 at 21:23

score 10 · Accepted Answer · edited May 23 '17 at 12:09

If you want just to get pages' content, the simpliest way is to use open-uri functions. They don't require additional gems. You just have to require 'open-uri' and... http://ruby-doc.org/stdlib-2.2.2/libdoc/open-uri/rdoc/OpenURI.html

To parse content you can use Nokogiri or other gems, which also can have, for example, useful XPATH-technology. You can find other parsing libraries just here on SO.

score 5 · Answer 3 · answered Feb 15 '12 at 06:47

5

You might want to check out wombat that is built on top of Mechanize/Nokogiri and provides a DSL (like Sinatra, for example) to parse pages. Pretty neat :)

answered Feb 15 '12 at 06:47

Felipe Lima

10,530
4
41
39

score 1 · Answer 4 · answered Mar 05 '12 at 21:35

1

I am working on pioneer gem which is not a spider, but a simple asynchronous crawler based on em-synchrony gem

answered Mar 05 '12 at 21:35

fl00r

82,987
33
217
237

1

Best I can tell, 'web spider' and 'web crawler' are synonymous. – David J. Jun 21 '12 at 17:37
English is not my native language, so I can be wrong, but it seems to me, that crawler is something more general then spider. Spider is a kind of complete stuff: it recursively surfing through links. And pioneer is more like a little framework. You could write your own spider with pioneer and you can do more ;). But you need to do more job manually, to use pioneer but it is more agile. – fl00r Jun 21 '12 at 18:42
1

According to [Wikipedia](http://en.wikipedia.org/wiki/Web_crawler): "Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots..." This matches up with the [StackOverflow synonyms for web crawler](http://stackoverflow.com/tags/web-crawler/synonyms). – David J. Jun 21 '12 at 19:38

score 0 · Answer 5 · answered Apr 19 '13 at 02:09

0

I just released one recently called Klepto. Its got a pretty simple DSL, is built on top of capybara and has lot of cool configuration options.

answered Apr 19 '13 at 02:09

Cory ODaniel

708
6
9

2

Would be nice if you could expand your answer and explain more about these cool options and why your library is better for the task. Also be careful when posting links to your own projects, the community can view it as a bit spammy. – Kev Apr 19 '13 at 02:32

Web crawler in ruby

5 Answers5

Linked