Python scrapy start_urls

Question

is it possible to do something like below but with multiple url like below? Each link will have about 50 pages to crawl and loop. The current solution is working but only working if I use 1 URL instead of multiple urls.

 start_urls = [

'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/garden/page-%s/c25449' % page for page in range(1, 50),
 ]

Hari Krishnan · Answer 1 · 2018-08-16T05:17:30.720

We can perform the operation by using another list. I've shared the code for it below. Hope this is what you're looking for.

final_urls=[]
start_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397',
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159',
'https://www.xxxxxxx.com.au/garden/page-%s/c25449']
final_urls.extend(url % page for page in range(1, 50) for url in start_urls)

Output Snippet

final_urls[1:20]


 ['https://www.xxxxxxx.com.au/automotive/page-1/c21159',
 'https://www.xxxxxxx.com.au/garden/page-1/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-2/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-2/c21159',
 'https://www.xxxxxxx.com.au/garden/page-2/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-3/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-3/c21159',
 'https://www.xxxxxxx.com.au/garden/page-3/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-4/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-4/c21159',
 'https://www.xxxxxxx.com.au/garden/page-4/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-5/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-5/c21159',
 'https://www.xxxxxxx.com.au/garden/page-5/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-6/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-6/c21159',
 'https://www.xxxxxxx.com.au/garden/page-6/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-7/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-7/c21159']

About your latest enquiry, have you tried this?

def parse(self, response):

    for link in final_urls:
        request = scrapy.Request(link)
        yield request

thank you so much! but how to i start parsing it? final_urls.extend(url % page for page in range(1, 50) for url in start_urls) — mrWiga, Aug 16 '18 at 04:46
This is what I have, how do I trigger it to run this: def parse(self, response): sel = Selector(response) for link in sel.xpath("//*[contains(@href, '/s-ad/')]"): ad_link = link.css('a::attr(href)').extract_first() absolute_url = self.base_url + ad_link yield response.follow(absolute_url, self.parse_each_ad) — mrWiga, Aug 16 '18 at 04:51
Please check this answer https://stackoverflow.com/questions/6566322/scrapy-crawl-urls-in-order, also i think this is what you are looking for https://stackoverflow.com/questions/32435776/is-it-possible-to-crawl-multiple-start-urls-list-simultaneously — Hari Krishnan, Aug 16 '18 at 04:55

score 0 · Accepted Answer · answered Aug 16 '18 at 05:54

I recommend to use start_requests for this:

def start_requests(self):
    base_urls = [

        'https://www.xxxxxxx.com.au/home-garden/page-{page_number}/c18397',
        'https://www.xxxxxxx.com.au/automotive/page-{page_number}/c21159',
        'https://www.xxxxxxx.com.au/garden/page-{page_number}/c25449',
    ]

    for page in range(1, 50):
        for base_url in base_urls:
            url = base_url.format( page_number=page )
            yield scrapy.Request( url, callback=self.parse )

Python scrapy start_urls

2 Answers2

Linked