How to turn scrapy spider to download image from start urls?

Question

start_urls = ['https://image.jpg']

def start_requests(self):
    for url in self.start_urls:
        request = scrapy.Request(url,callback=self.parse)
        yield request

def parse(self, response):
    item = GetImgsItem()
    # print(response.url)
    item['image_urls'] = response.url
    yield item

My spider can now download the image from start_urls but the request was sent twice to give one image. How should I turn it to download in start_requests ?

Question 2: I created two spiders (spider A , spider B) in my project. In spider A, I have a specific pipeline class to deal the downloaded items. It works well now.

But later when I used spider B, it also used the same pipeline class of spider A. How should I set pipeline class so that it is exclusive for spider A to use ?

Does this answer your question? [Returning Items in scrapy's start\_requests()](https://stackoverflow.com/questions/35300052/returning-items-in-scrapys-start-requests) — Gallaecio, Apr 27 '20 at 19:56

score 1 · Answer 1 · answered Apr 26 '20 at 21:19

To answer your second question take a look at this post:

How can I use different pipelines for different spiders in a single Scrapy project

You can also just delete the pipeline part in your settings.py file and create custom_settings in your spider.

class SpiderA(scrapy.Spider):
    name = 'spider_a'
    custom_settings = {
        'ITEM_PIPELINES': {
            'project.pipelines.MyPipeline': 300
        }
    }

But I think the example shown in the post above is a bit more elegant.

score 0 · Answer 2 · answered Jan 20 '22 at 05:08

For the first question, you could start with a dummy request and then yield image items in your parse method. This could avoid some hacks to other middlewares.

start_urls = ['https://any.dummy.website']
image_urls = [...]

def parse(self, dummy_response):
    yield Item(image_urls=self.image_urls)

How to turn scrapy spider to download image from start urls?

2 Answers2