I'm using Scrapy to crawl a set of similar pages (webcomics). Because these pages are very similar, I wrote a class called ComicCrawler
which contains all the spider logic and some class variables (start_url
, next_selector
, etc.). I then override these class variables in concrete classes for each spider.
Manually creating classes for each comic is cumbersome. I now want to specify the attributes in a JSON file and create the classes during runtime (ie. apply the factory pattern (?)) How do I best go about that?
Alternatively: Is there a way to run a spider without creating a class for it? Edit: The core problem seems to be that Scrapy uses classes, not instances for its spiders. Otherwise I'd just make the class variables instance variables and be done with it.
Example:
class ComicSpider(Spider):
name = None
start_url = None
next_selector = None
# ...
# this class contains much more logic than shown here
def start_requests(self):
# something including / along the lines of...
yield Request (self.start_url, self.parse)
def parse(self, response):
# something including / along the lines of...
yield Request(response.css(self.next_selector).get(), self.parse)
in another file:
class SupernormalStep(ComicSpider):
name = "SupernormalStep"
start_url = "https://supernormalstep.com/archives/8"
next_selector = "a.cc-next"
what I want:
myComics = {
"SupernormalStep": {
"start_url": "https://supernormalstep.com/archives/8",
"next_selector": "a.cc-next"
}, # ...
}
process = CrawlerProcess(get_project_settings())
for name, attributes in myComics:
process.crawl(build_process(name, attributes))
PS: I crawl responsibly.