0

is it possible to do something like below but with multiple url like below? Each link will have about 50 pages to crawl and loop. The current solution is working but only working if I use 1 URL instead of multiple urls.

 start_urls = [

'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/garden/page-%s/c25449' % page for page in range(1, 50),
 ]
mrWiga
  • 131
  • 1
  • 2
  • 13

2 Answers2

0

We can perform the operation by using another list. I've shared the code for it below. Hope this is what you're looking for.

final_urls=[]
start_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397',
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159',
'https://www.xxxxxxx.com.au/garden/page-%s/c25449']
final_urls.extend(url % page for page in range(1, 50) for url in start_urls)
Output Snippet
final_urls[1:20]


 ['https://www.xxxxxxx.com.au/automotive/page-1/c21159',
 'https://www.xxxxxxx.com.au/garden/page-1/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-2/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-2/c21159',
 'https://www.xxxxxxx.com.au/garden/page-2/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-3/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-3/c21159',
 'https://www.xxxxxxx.com.au/garden/page-3/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-4/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-4/c21159',
 'https://www.xxxxxxx.com.au/garden/page-4/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-5/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-5/c21159',
 'https://www.xxxxxxx.com.au/garden/page-5/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-6/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-6/c21159',
 'https://www.xxxxxxx.com.au/garden/page-6/c25449',
 'https://www.xxxxxxx.com.au/home-garden/page-7/c18397',
 'https://www.xxxxxxx.com.au/automotive/page-7/c21159']

About your latest enquiry, have you tried this?

def parse(self, response):

    for link in final_urls:
        request = scrapy.Request(link)
        yield request
Hari Krishnan
  • 2,049
  • 2
  • 18
  • 29
  • thank you so much! but how to i start parsing it? final_urls.extend(url % page for page in range(1, 50) for url in start_urls) – mrWiga Aug 16 '18 at 04:46
  • loop through final_urls and Request and process each url. – Hari Krishnan Aug 16 '18 at 04:50
  • This is what I have, how do I trigger it to run this: def parse(self, response): sel = Selector(response) for link in sel.xpath("//*[contains(@href, '/s-ad/')]"): ad_link = link.css('a::attr(href)').extract_first() absolute_url = self.base_url + ad_link yield response.follow(absolute_url, self.parse_each_ad) – mrWiga Aug 16 '18 at 04:51
  • Please check this answer https://stackoverflow.com/questions/6566322/scrapy-crawl-urls-in-order, also i think this is what you are looking for https://stackoverflow.com/questions/32435776/is-it-possible-to-crawl-multiple-start-urls-list-simultaneously – Hari Krishnan Aug 16 '18 at 04:55
  • So sorry but I dont see any relation to this? – mrWiga Aug 16 '18 at 05:12
0

I recommend to use start_requests for this:

def start_requests(self):
    base_urls = [

        'https://www.xxxxxxx.com.au/home-garden/page-{page_number}/c18397',
        'https://www.xxxxxxx.com.au/automotive/page-{page_number}/c21159',
        'https://www.xxxxxxx.com.au/garden/page-{page_number}/c25449',
    ]

    for page in range(1, 50):
        for base_url in base_urls:
            url = base_url.format( page_number=page )
            yield scrapy.Request( url, callback=self.parse )
gangabass
  • 10,607
  • 2
  • 23
  • 35