Understand the scrapy framework architecture

Question

Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a database, not the responses, the links. This is for sanity checking.

My initial thought was to use the process_links parameter on a rule and generate items in the function that it points to. However, whereas the callback parameter points to a function that is an item generator, the process_links paramter works more like a filter. In the callback function you yield items and they are automaticaly collected and put in the pipeline. In the process_links function you return a list of links. You don't generate items.

I could just make a database connection in the process_links function and write directly to the datatabase, but that doesn't feel like the right way to go when scrapy has built-in asynchronous database transaction processing via Twisted.

I could try to pass items from the process_links function to the callback function, but I'm not sure about the relationship between the two functions. One is used to generate items, and one receives a list and has to return a list.

In trying to think this through, I keep coming up against the fact that I don't understand the control loop within scapy. What is the process that is reading the items yielded by the callback function? What's the process that supplies the links to, and receives the links from, the process_links function? The one that takes requests and returns responses?

From my point of view, I write code in a spider which genreates items. The items are automatically read and moved through a pipeline. I can create code in the pipeline and the items will be automatically passed into and taken out of that code. What's missing is my understanding of exactly how these items get moved through the pipeline.

Looking through the code I can see that the base code for a spider is hiding way in corner, as all good spiders should, and going under the name of __init__.py. It contains the starts_requests() and make_requests_from_url() functions which according to docs are the starting points. But it's not a controlling loop. It's being called by something else.

Going from the opposite direction, I can see that when I execute the command scrapy crawl... I'm calling crawl.py which in turn calls self.crawler_process.start() in crawler.py. That starts a Twisted reactor. There is also core/engine.py which is another collection of functions which look as though they are designed to control the operation of the spiders.

Despite looking through the code, I don't have a clear mental image of the entire process. I realise that the idea of a framework is that it hides much of the complexity, but I feel that with a better understanding of what is going on, I could make better use of the framework.

Sorry for the long post. If anyone can give me an answer to my specific problem regarding save links to the database, that would be great. If you were able to give a brief overview of the architecture, that would be extremely helpful.

score 4 · Answer 1 · answered Dec 16 '15 at 12:59

This is how Scrapy works in short:

You have Spiders which are responsible for crawling sites. You can use separate spiders for separate sites/tasks.
You provide one or more start urls to the spider. You can provide them as a list or use the start_requests method
When we run a spider using Scrapy, it takes these URLs and fetches the HTML response. The response is passed to the callback on the spider class. You can explicitly define a callback when using the start_requests method. If you don't, Scrapy will use the parse method as the callback.
You can extract whatever data you need from the HTML. The response object you get in the parse callback allows you do extract the data using css selectors or xpath.
If you find the data from the response, you can construct the Items and yield them. If you need to go to another page, you can yield scrapy.Request.
If you yield a dictionary or Item object, Scrapy will send those through the registered pipelines. If you yield scrapy.Request, the request would be further parsed and the response will be fed back to a callback. Again you can define a separate callback or use the default one.
In the pipelines, your data (dictionary or Item) go through the pipeline processors. In the pipelines you can store them in database or whatever you want to do.

So in short:

In parse method or in any method inside the spider, we would extract and yield our data so they are sent through the pipelines.
In the pipelines, you do the actual processing.

Here's a simple spider and pipeline example: https://gist.github.com/masnun/e85b38a00a74737bb3eb

It's a good description of the basics, but to be honest I was already familiar with everything you've written. My question is more about what happens behind the scenes. For example, when you write " it takes these URLs and fetches the HTML response. The response is passed to the callback on the spider class" what exactly happens there? Similarly with "If you yield a dictionary or Item object, Scrapy will send those through the registered pipelines", how exactly does it do that? In the example I've given, I would like to generate items from outside of the callback. I can't see how to do that. — user3185563, Dec 16 '15 at 21:51
I agree with your thought process. I'd like to understand the architecture better too. http://stackoverflow.com/questions/42515747/initializing-pipeline-object-with-crawler-in-scrapy?noredirect=1#comment72171993_42515747 — user1592380, Mar 01 '17 at 00:54

Turo · Answer 2 · 2015-12-16T16:35:02.800

I started using Scrapy not so long ago and I had some of your doubts myself (also considering I started with Python overall), but now it works for me, so don’t get discouraged – it’s a nice framework.

First, I would not get too worried at this stage about the details behind the framework, but rather start writing some basic spiders yourself.

Some of really key concepts are:

Start_urls – they define an initial URL (or URLs), where you will further look either for text or for further links to crawl. Let’s say you want to start from e.g. http://x.com
Parse(self.response) method – this will be the first method that will be processed that will give you Response of http://x.com. (basically its HTML markup)
You can use Xpath or CSS selectors to extract information from this markup e.g. a = Response.Xpath(‘//div[@class=”foo”]/@href’) will extract the link to a page (e.g. http://y.com)
If you want to extract the text of the link, so literally "http://y.com" you just yield (return) an item within Parse(self.response) method. So your final statement in this method will be yield item. If you want to go deeper and dwell to http://y.com your final statement will be scrapy.Request(a, callback= self.parse_final) - parse_final being here an example of the callback to the parse_final(self.response) method.
Then you can extract the elements of html of http://y.com as the final call in parse_final(self.response) method, or keep repeating the process to dig for further links in the page structure
Pipelines are for processing items. So when items get yielded, they are by default just printed on the screen. So in pipelines you can redirect them either to csv file, database etc.

The entire process gets more complex, when you start getting more links in each of the methods, based on various conditions you call various callbacks etc. I think you should start with getting this concept first, before going to pipelines. The examples from Scrapy are somewhat difficult to get, but once you get the idea it is really nice and not that complicated in the end.

Thanks for the response. I am already familar with the basics. I'm writing my first spider, but both it and the pipeline consist of several hundred lines. I've read all of the documentation, but I keep running up against the problem of not knowing what's going on behind the scenes. — user3185563, Dec 16 '15 at 21:54
Several hundreds lines seems lot for a first spider, you might be doing something around. I have lower-mid complexity spiders which are less than 70 lines of code each, 15 lines to save to CSV. Few callback methods are enough for a relatively average page (my examples are apx. 20k pages returned). If you are really looking into more complex scenarios (distributed crawls, interacting with Twisted) then you will need to understand much more behind the engine. The way the question is asked I sense it's just a bit more about studying the examples in the docs as a starter and this should come. — Turo, Dec 16 '15 at 22:50
One more suggestion I would have is to stick to the base spider. CrawlSpider with Rules, LinkExtractor should be making the process easier, but at least for me I feel it just adds additional complexity. You will probably only need to know more about Twisted when you want to do parallel crawls with more than one spider at once. — Turo, Dec 16 '15 at 23:01

Understand the scrapy framework architecture

2 Answers2