I just learned about Anemone the spider framework. Its site said
Note: Every storage engine will clear out existing Anemone data before beginning a new crawl.
Question: I am wondering if I can avoid this, i.e. keep what has been crawled, and refresh/update the copy during the new crawl?
Rationale:
I want to use Anemone as a local store of remote web pages. My existing page parsers can then access Nokogiri dom document objects from it. Many page parsers will need to access the same url address, so this should avoid duplicate fetching of the same page.
Plus, Anemone may be smart enough to use http expire header to determine if a page has been updated hence needs to be re-downloaded (since it has the previous dom document).