RCrawler : way to limit number of pages that RCrawler collects? (not crawl depth)

Question

I'm using RCrawler to crawl ~300 websites. The size of websites is quite diverse: some are small (dozen or so pages) and others are large (1000s pages per domain). To crawl the latter is very time-consuming, and - for my research purpose - the added value of more pages when I already have a few hundred, decreases.

So: is there a way to stop the crawl if an x number of pages is collected?

I know I can limit the crawl with MaxDepth, but even at MaxDepth=2, this is still an issue. MaxDepth=1 is not desirable for my research. Also, I'd prefer to keep MaxDepth high, so the smaller websites do get crawled completely.

Thanks a lot!

score 0 · Answer 1 · answered Jan 25 '20 at 21:51

0

How about implementing a custom function for the FUNPageFilter parameter of the Rcrawler function? The custom function checks the number of files in DIR and returns FALSE if there are too many files.

answered Jan 25 '20 at 21:51

Dan T.

11
2

RCrawler : way to limit number of pages that RCrawler collects? (not crawl depth)

1 Answers1