Recently, I've been trying to get to grips with scrapy. I feel if I had a better understanding to the architecture, I'd move a lot faster. The current, concrete problem I have this: I want to store all of the links that scrapy extracts in a database, not the responses, the links. This is for sanity checking.
My initial thought was to use the process_links
parameter on a rule
and generate items
in the function that it points to. However, whereas the callback
parameter points to a function that is an item generator, the process_links
paramter works more like a filter. In the callback
function you yield items and they are automaticaly collected and put in the pipeline. In the process_links
function you return a list of links. You don't generate items.
I could just make a database connection in the process_links
function and write directly to the datatabase, but that doesn't feel like the right way to go when scrapy has built-in asynchronous database transaction processing via Twisted.
I could try to pass items from the process_links
function to the callback
function, but I'm not sure about the relationship between the two functions. One is used to generate items, and one receives a list and has to return a list.
In trying to think this through, I keep coming up against the fact that I don't understand the control loop within scapy. What is the process that is reading the items yielded by the callback
function? What's the process that supplies the links to, and receives the links from, the process_links
function? The one that takes requests
and returns responses
?
From my point of view, I write code in a spider which genreates items
. The items
are automatically read and moved through a pipeline. I can create code in the pipeline and the items
will be automatically passed into and taken out of that code. What's missing is my understanding of exactly how these items
get moved through the pipeline.
Looking through the code I can see that the base code for a spider is hiding way in corner, as all good spiders should, and going under the name of __init__.py
. It contains the starts_requests(
) and make_requests_from_url()
functions which according to docs are the starting points. But it's not a controlling loop. It's being called by something else.
Going from the opposite direction, I can see that when I execute the command scrapy crawl...
I'm calling crawl.py
which in turn calls self.crawler_process.start()
in crawler.py
. That starts a Twisted reactor. There is also core/engine.py
which is another collection of functions which look as though they are designed to control the operation of the spiders.
Despite looking through the code, I don't have a clear mental image of the entire process. I realise that the idea of a framework is that it hides much of the complexity, but I feel that with a better understanding of what is going on, I could make better use of the framework.
Sorry for the long post. If anyone can give me an answer to my specific problem regarding save links to the database, that would be great. If you were able to give a brief overview of the architecture, that would be extremely helpful.