How to generate the start_urls dynamically in crawling?

Question

I am crawling a site which may contain a lot of start_urls, like:

http://www.a.com/list_1_2_3.htm

I want to populate start_urls like [list_\d+_\d+_\d+\.htm], and extract items from URLs like [node_\d+\.htm] during crawling.

Can I use CrawlSpider to realize this function? And how can I generate the start_urls dynamically in crawling?

More info on start_requests http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests — briankip, Apr 05 '16 at 15:51

score 46 · Answer 1 · edited Nov 26 '18 at 16:10

46

The best way to generate URLs dynamically is to override the start_requests method of the spider:

from scrapy.http.request import Request

def start_requests(self):
      with open('urls.txt', 'rb') as urls:
          for url in urls:
              yield Request(url, self.parse)

edited Nov 26 '18 at 16:10

Johannes Filter

1,863
3
20
25

answered Apr 30 '12 at 06:35

juraseg

469
4
4

score 16 · Answer 2 · answered Feb 17 '12 at 07:32

There are two questions:

1)yes you can realize this functionality by using Rules e.g ,

rules =(Rule(SgmlLinkExtractor(allow = ('node_\d+.htm')) ,callback = 'parse'))

suggested reading

2) yes you can generate start_urls dynamically , start_urls is a

list

e.g >>> start_urls = ['http://www.a.com/%d_%d_%d' %(n,n+1,n+2) for n in range(0, 26)]

>>> start_urls

['http://www.a.com/0_1_2', 'http://www.a.com/1_2_3', 'http://www.a.com/2_3_4', 'http://www.a.com/3_4_5', 'http://www.a.com/4_5_6', 'http://www.a.com/5_6_7',  'http://www.a.com/6_7_8', 'http://www.a.com/7_8_9', 'http://www.a.com/8_9_10','http://www.a.com/9_10_11', 'http://www.a.com/10_11_12', 'http://www.a.com/11_12_13', 'http://www.a.com/12_13_14', 'http://www.a.com/13_14_15', 'http://www.a.com/14_15_16', 'http://www.a.com/15_16_17', 'http://www.a.com/16_17_18', 'http://www.a.com/17_18_19', 'http://www.a.com/18_19_20', 'http://www.a.com/19_20_21', 'http://www.a.com/20_21_22', 'http://www.a.com/21_22_23', 'http://www.a.com/22_23_24', 'http://www.a.com/23_24_25', 'http://www.a.com/24_25_26', 'http://www.a.com/25_26_27']

Thank you for answers. But I want generate the start_urls during crawling: when I meet a url like 'http://www.a.com/%d_%d_%d', I add it into start_urls. I can not confirm the range of start_urls at first... — user1215269, Feb 17 '12 at 09:54
as for as i know, scrapy add start_urls Requests in scheduler at the start of crawling if you add any url in the list of start_urls during crawling that will not be executed. — akhter wahab, Feb 17 '12 at 10:55
For some reason (2) only works in the interpreter shell and fails in the `runspider`. — not2qubit, Sep 25 '18 at 22:09
How does python know its suppose to be %d. What if you use %h or %x? Why %d? — Maciek Semik, Mar 06 '19 at 00:10

How to generate the start_urls dynamically in crawling?

2 Answers2

Linked