Questions: if I have a Scrapy crawlspider with a callback that does standard link extraction and another that finds relative file paths under the img src
tag with xpath, what is happening generally as these functions are executed? Is all of the information relevant to both functions held in memory so there is no need for multiple crawls? Or will the site be crawled multiple times since the callbacks seem to be finding different things (i.e., absolute paths as well as relative paths the standard link extractor doesn't return)?
Background: I had a crawlspider that does basic link extraction and another that looks for relative file paths for images under the img src
HTML tag. For the sake of efficiency, I just put the two function definitions under the same spider. However, before I pulled the trigger on crawling the target site, I wanted to get a sense for whether this could increase the risk of getting blocked or generally place larger demands on the target site. If, for example, the spider crawled the entire domain twice, that would seem to increase the risk level. But, if I'm correctly interpreting the main response to this question, the crawlspider may hold everything these different functions are scraping in memory such that having multiple callbacks doesn't increase my footprint on the target domain. If I understood a little better what is going on under the hood, I would feel more comfortable before crawling sites where I think there is a legitimate risk of getting banned or something else.
Thank you!