0

So I want to scrape something like a list of articles, i.e cnn.com. I'm currently using scrapy's CrawlSpider to do so. However, I need them to be scraped in order. At this time, the crawler will crawl the 1st article in the list but then skip to the 31st, 16th, 24th, 9th, etc.

Is there any way to make the spider crawl links on the page in order (i.e top to bottom since recent articles appear at the top of the list) ? I've looked around a little bit and found this, but unlike that post I don't want to crawl the start_urls in a certain order, I want to crawl the links of a start_url in order. Is this possible with scrapy? I played around with a couple of things like DEPTH_PRIORITY, but I'm not sure that's what I am looking for.

Any help would be greatly appreciated, thanks!!

ocean800
  • 3,489
  • 13
  • 41
  • 73
  • How do you get those links? And are they in order when you got them? – Bubble Bubble Bubble Gut Jun 16 '17 at 00:55
  • @Ding I crawl the individual articles where the `start_url` is the page that lists the articles. The `CrawlSpider` then crawls the individual article links... But not in order, which is my problem – ocean800 Jun 16 '17 at 01:00
  • Have you tried assigning [priority](https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects) to `Request` objects? – Bubble Bubble Bubble Gut Jun 16 '17 at 01:07
  • @Ding But as the requests for each individual article are generated dynamically by the `CrawlSpider`, how can I assign a priority to the requests? Since I don't know the list of articles, how can I know which articles to prioritize? Not sure if I understood you properly – ocean800 Jun 16 '17 at 02:00
  • Im pretty sure that by restricting the follow path correctly would solve these, i would just create a regular spider. Also, im sure cnn has a public sitemap... xmlspider would more light ergo much faster and less resource used and getyou articles in order. – scriptso Jun 17 '17 at 16:23

0 Answers0