2

enter image description here

I'm working with scrapy. I have a pipieline that starts with:

class DynamicSQLlitePipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        # Here, you get whatever value was passed through the "table" parameter
        table = getattr(crawler.spider, "table")
        return cls(table)

    def __init__(self,table):
        try:
            db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
            db = dataset.connect(db_path)
            table_name = table[0:3]  # FIRST 3 LETTERS
            self.my_table = db[table_name]

I've been reading through https://doc.scrapy.org/en/latest/topics/api.html#crawler-api , which contains:

The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy.

but still do not understand the from_crawler method, and the crawler object. What is the relationship between the crawler object with spider and pipeline objects? How and when is a crawler instantiated? Is a spider a subclass of crawler? I've asked Passing scrapy instance (not class) attribute to pipeline, but I don't understand how the pieces fit together.

user1592380
  • 34,265
  • 92
  • 284
  • 515
  • see [scrapy's architecture](https://doc.scrapy.org/en/latest/topics/architecture.html) – furas Dec 25 '17 at 23:30
  • Thank you, but crawler is not on that diagram. – user1592380 Dec 25 '17 at 23:33
  • if I understand correctly `crawler` is object which runs engine to do all this things on image - get url, read data from server, uses `spider` only to parse data, uses pipeline and midleware to change data and make other things like writing to file. – furas Dec 25 '17 at 23:36

1 Answers1

2

Crawler is actually one of the most important objects in the Scrapy's architecture. It is a central piece of the crawling execution logic which "glues" a lot of other pieces together:

The main entry point to Scrapy API is the Crawler object, passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy.

A crawler or multiple crawlers are controlled by the CrawlerRunner or the CrawlerProcess instance.

Now that from_crawler method which is available on lots of Scrapy components is just a way for these components to get access to the crawler instance that is running this particular component.

Also, look at the Crawler, CrawlerRunner and CrawlerProcess actual implementations.

And, what I personally found helpful in order to better understand how Scrapy works internally was to run a spider from a script - check out these detailed step-by-step instructions.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195