0

The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc).

The basic anemome loop worked well:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

But (because of the site size?) the terminal froze after a while. Therefore, I installed MongoDB and used the following

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

It's running now, but I'll be very surprised if there's output in the json file when I get back in the morning - I've never used MongoDB before and the part of the anemone docs about using storage weren't clear (to me at least). Can anyone who's done this before give me some tips?

sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
mustacheMcGee
  • 481
  • 6
  • 19
  • UPDATE: I came in this morning and there were about 1500 urls in my output file, far less than the 1.x million I'm going for. There was also an error in the command line window: `Serialization Failed: failed to allocate memory in bson_buffer.c` – mustacheMcGee Aug 22 '13 at 13:54
  • UPDATE 2: Using JRuby got about twice the number of URLs, 3300 – mustacheMcGee Aug 22 '13 at 18:51
  • 2
    UPDATE 3: The Spidr gem is doing much better, 70,000 URLs and counting – mustacheMcGee Aug 23 '13 at 11:48

2 Answers2

3

If anyone out there needs <= 100,000 URLs, the Ruby Gem Spidr is a great way to go.

joshweir
  • 5,427
  • 3
  • 39
  • 59
mustacheMcGee
  • 481
  • 6
  • 19
2

This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.

Anemone is not a maintained library and fails on many edge cases.

Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.

http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/

Does ruby have real multithreading?

You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.

I had some mild success going from Anemone to Nutch but your mileage may vary.

Community
  • 1
  • 1
sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
  • Just spinning my wheels trying to get Nutch going - had to install JDK and now I'm stuck on `'bin' is not recognized as an internal or external command`. Tried JRuby as you suggested but it timed out aswell with `Anemone::Storage::GenericError: Java heap space` (JRuby did seem to get more URLs before it froze however) – mustacheMcGee Aug 22 '13 at 17:05
  • 1
    Hey MustacheMcGee, what triggers the error? What OS are you using? – sunnyrjuneja Aug 22 '13 at 20:12
  • I'm using Windows 7 (64bit). I've closed that cmd-line window so I can't find the exact working... I think the rest of the error message was just specific lines in the the Anemone.rb file.. – mustacheMcGee Aug 23 '13 at 14:04
  • If spidr is workng well for you, add it as an answer and mark this question answered! – sunnyrjuneja Aug 23 '13 at 19:34
  • I would say Spidr is working MUCH better, but it has slowed to a crawl at around 100,000 URLs. So the question is not completely resolved. If anyone out there needs <= 100,000 URLs, Spidr is a great way to go. – mustacheMcGee Aug 26 '13 at 14:09
  • Have you considered breaking it up into multiple jobs? Where are you storing the URLs? A database or in memory? – sunnyrjuneja Aug 26 '13 at 18:59
  • I was using MongoDB with Anenome. Breaking up the job would probably be a good way to go. I have webmaster tools access so I think I will try their API to get the full list of indexed URLs – mustacheMcGee Aug 27 '13 at 19:53
  • How are you storing the URLs with Spidr? – sunnyrjuneja Aug 27 '13 at 21:06
  • I didn't :O I couldn't find anything in the documentation about plugging in MongoDB or similar... – mustacheMcGee Aug 27 '13 at 21:39