3

I just learned about Anemone the spider framework. Its site said

Note: Every storage engine will clear out existing Anemone data before beginning a new crawl.

Question: I am wondering if I can avoid this, i.e. keep what has been crawled, and refresh/update the copy during the new crawl?

Rationale:

I want to use Anemone as a local store of remote web pages. My existing page parsers can then access Nokogiri dom document objects from it. Many page parsers will need to access the same url address, so this should avoid duplicate fetching of the same page.

Plus, Anemone may be smart enough to use http expire header to determine if a page has been updated hence needs to be re-downloaded (since it has the previous dom document).

lulalala
  • 17,572
  • 15
  • 110
  • 169
  • Seems like a simple monkey patch is called for here. It looks like Anemone supports several storage possibilities (redis, sqlite, etc.) If you can't figure it out yourself and specify one, someone will probably help you out. – pguardiario Nov 23 '12 at 09:04
  • @lulalala can `wget` can help – Viren Nov 24 '12 at 06:26

0 Answers0