0

Questions: if I have a Scrapy crawlspider with a callback that does standard link extraction and another that finds relative file paths under the img src tag with xpath, what is happening generally as these functions are executed? Is all of the information relevant to both functions held in memory so there is no need for multiple crawls? Or will the site be crawled multiple times since the callbacks seem to be finding different things (i.e., absolute paths as well as relative paths the standard link extractor doesn't return)?

Background: I had a crawlspider that does basic link extraction and another that looks for relative file paths for images under the img src HTML tag. For the sake of efficiency, I just put the two function definitions under the same spider. However, before I pulled the trigger on crawling the target site, I wanted to get a sense for whether this could increase the risk of getting blocked or generally place larger demands on the target site. If, for example, the spider crawled the entire domain twice, that would seem to increase the risk level. But, if I'm correctly interpreting the main response to this question, the crawlspider may hold everything these different functions are scraping in memory such that having multiple callbacks doesn't increase my footprint on the target domain. If I understood a little better what is going on under the hood, I would feel more comfortable before crawling sites where I think there is a legitimate risk of getting banned or something else.

Thank you!

Tigelle
  • 58
  • 14
  • all `Request` use only absolute urls and scrapy filters duplicated links. – furas Jan 02 '18 at 07:04
  • Scrapy's [Architecture](https://doc.scrapy.org/en/latest/topics/architecture.html) – furas Jan 02 '18 at 07:05
  • you can be banned even if you don't get the same urls but you get urls with speed which human can reach - ie. human can open 10 links in 0.1s or open urls with the same delay. (but happly scrapy uses randon delays for requests) You can also use proxy servers to have different IPs (and use different "User-Agent") and then it looks like many different humans. – furas Jan 02 '18 at 07:11
  • @furas Thanks for your comments. I appreciate your help. On your first comment, I'm not sure I understand exactly what that means. On the second comment, I had seen the Scrapy architecture page, but I still wasn't clear on what happens with multiple callbacks. On the third comment, I know those things. Specifically, I'm trying to figure out what, if any, additional risks Scrapy might present with different functions under the same spider. – Tigelle Jan 03 '18 at 01:13

0 Answers0