I want to either disable the depth checking and iteration for a method in my spider or change the depth limit while crawling. Here's some of my code:
def start_requests(self):
if isinstance(self.vuln, context.GenericVulnerability):
yield Request(
self.vuln.base_url,
callback=self.determine_aliases,
meta=self._normal_meta,
)
else:
for url in self.vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
@inline_requests
def determine_aliases(self, response):
vulns = [self.vuln]
processed_vulns = set()
while vulns:
vuln = vulns.pop()
if vuln.vuln_id is not self.vuln.vuln_id:
response = yield Request(vuln.base_url)
processed_vulns.add(vuln.vuln_id)
aliases = context.create_vulns(*list(self.parse(response)))
for alias in aliases:
if alias.vuln_id in processed_vulns:
continue
if isinstance(alias, context.GenericVulnerability):
vulns.append(alias)
else:
logger.info("Alias discovered: %s", alias.vuln_id)
self.cves.add(alias)
yield from self._generate_requests_for_vulns()
def _generate_requests_for_vulns(self):
for vuln in self.cves:
for url in vuln.entrypoint_urls:
yield Request(
url, callback=self.parse, meta=self._patch_find_meta
)
My program is such that the user can give the depth limit they need/want as an input. Under some conditions, my default parse method allows recursively crawling links.
determine_aliases is kind of a preprocessing method, and the requests generated from _generate_requests_for_vulns are for the actual solution.
As you can see, I scrape the data I need from the response and store that in a set attribute 'cves' in my spider class from determine_aliases. Once that's done, I yield Requests w/r/t that data from _generate_requests_for_vulns.
The problem here is that either yielding requests from determine_aliases or calling determine_aliases as a callback iterates the depth. So when I yield Requests from _generate_requests_for_vulns for further crawling, my depth limit is reached sooner than expected.
Note that the actual crawling solution starts from the requests generated by _generate_requests_for_vulns, so the given depth limit should be applied only from those requests.