Scrapy: fastest way to bulk upload to database?

Question

My Scrapy spider uploads results to a Mongo database every 1000 scraped URLs. I'm appending the results to a list before uploading. Given that appending to lists is somewhat slow, is there a way I save the results using list comprehension? Is saving to list fastest?

Here's my (simplified) spider:

class QuotesSpider(scrapy.Spider):
    name = "spy"

    def __init__(self):
        # init MongoDB instance
        self.res_list = []
        self.urls = self.x.urls(10000)

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

    async def do_insert(self, documents):
        await self.db['coll'].insert_many(documents)

    def parse(self, r):
        res = self.x.process(r)
        self.res_list.append(res)
        if len(self.res_list) > 1000:
            url_list = [u['url'] for u in self.res_list]
            loop = asyncio.get_event_loop()
            loop.run_until_complete(self.do_insert(self.res_list))
            print('UPLOADING TO DATABASE...')
            self.res_list = []

Do you have a performance issue, or you want to check if there is some faster option? — Tarun Lalwani, Apr 03 '21 at 14:47
Does this answer your question? [Why is a list comprehension so much faster than appending to a list?](https://stackoverflow.com/questions/30245397/why-is-a-list-comprehension-so-much-faster-than-appending-to-a-list) — Tom Slabbaert, Apr 03 '21 at 14:51
@TarunLalwani yes technically it's a performance issue. The program currently scrapes at ~550 pages/min with a theoretical max of ~1100 pages/min. Since I have hundreds of millions of links, increasing throughput is needed (in addition to deploying parallel spiderrs) — mmz, Apr 03 '21 at 15:01
@TomSlabbaert apologies for not clarifying. I'm indeed familiar with list comprehension being faster. My question is geared towards how I could use list comprehension in the above code example, since the parsing is done async. For example, it wouldn't be possible to do `test_list = [self.x.process(x) for x in r]` since the parsing function is fed a response object — mmz, Apr 03 '21 at 15:05

Scrapy: fastest way to bulk upload to database?

0 Answers0