Writing to db via spider v/s via pipelines.py

Question

Does it matter that my scrapy script writes to my MySQL db in the body of the spider instead of through pipelines.py? Does this slow down the spider? Note that I doubt have any items listed on items.py

Follow up: how and when is pipelines.py invoked? What happens after the yield statement?

score 1 · Accepted Answer · answered Apr 10 '17 at 18:22

1

It highly depends on the implementation, but if you implement database writing in a fashion that doesn't block too much then there isn't much different performance wise.

There is, however, a pretty huge structural difference. Scrapy's design philosphy highly encourages using middlewares and pipelines for the sake of keeping spiders clean and understandable.

In other words - spider bit should crawl data, middlewares should modify requests and responses and pipelines should pipe returned data through some external logic (like put it into a database or to a file).

Regarding your follow up question:

how and when is pipelines.py invoked? What happens after the yield statement?

Take a look at Architectual Overview documentation page and if you'd like to dig deeper you'd have to understand twisted asyncronious framework since scrapy is just a big, smart framework around it.

answered Apr 10 '17 at 18:22

Granitosaurus

20,530
5
57
82

"pipelines should pipe returned data through some external logic" - pedantic but, I would say that it's better to use an extension and more specifically a [Feed Export](https://doc.scrapy.org/en/latest/topics/feed-exports.html). [This](https://github.com/scrapy/scrapy/blob/d8672689761f0bb6c0550a841f35534265e87fee/scrapy/extensions/feedexport.py) is where scrapy has its default Feed Exports. Pipelines are more for domain specific business logic that might enrich or drop an `Item`. – neverlastn Apr 11 '17 at 13:52
@neverlastn while you are correct that feed exporters should be used when available, it doesn't mean you can't have async exports via pipelines - after all the whole scrapy engine is running on twisted engine which is accessible at all times. Built in feed exporters are very hard to extend too. I guess more to the point I mean "external logic" as anything that is not page parsing, not neccessarily calling some external script or program. – Granitosaurus Apr 11 '17 at 14:18
I might have become a bit cynical, but for anything but very small crawls, the easiest, good enough approach is to dump on a local file and at the end of the crawl use another technique to batch import (e.g. while within a single SQL transaction and by locking just once). Otherwise you end up with async APIs that most people won't get right, weird performance issues and a model of per-item VS per-batch/job import which means bad insert performance and that you will likely have to deduplicate or fix corrupt data e.g. if your job crashes and you have to restart. – neverlastn Apr 11 '17 at 20:38
@neverlastn I'm with you on that, KISS :) Unfortunately it's not always possible in complex production crawls. – Granitosaurus Apr 12 '17 at 05:53

score 1 · Answer 2 · edited May 23 '17 at 12:34

1

If you want the best performance, store items in a file (e.g. csv) and when your crawl completes bulk insert them to your database. For csv data, you could use mysqlimport (see MySQL bulk insert from CSV data files). The reccomended approach is to not block while inserting. This would require you to use a pipeline that uses the Twisted RDBMS API.

edited May 23 '17 at 12:34

Community

1
1

answered Apr 11 '17 at 13:43

neverlastn

2,164
16
23

Writing to db via spider v/s via pipelines.py

2 Answers2