0

I'm use CrawlSpider and have rule defined but after start_url spider goes to the last page instead of second page. Why is this happen and how to write rule to follow pages in correct order 2,3,4... etc.

class MySpider(CrawlSpider):
    name = "spidername"
    allowed_domains = ["example.com"]
    start_urls = [
    "http://www.example.com/some-start-url.html",
    ]

    rules = (
    # Extract links from the page
    Rule(SgmlLinkExtractor(allow=('/Page-\d+.html', )), callback='parse_links',follow=True),
    )

Targeted site has little strange pagination but defined rule find all existing pages.

Goran
  • 6,644
  • 11
  • 34
  • 54
  • possible duplicate of [Scrapy not crawling subsequent pages in order](http://stackoverflow.com/questions/11049088/scrapy-not-crawling-subsequent-pages-in-order) – Talvalin Feb 09 '14 at 20:55
  • Hard to tell with a dummy URL, but your page may have a `Last Page` link that scrapy is following, instead of a `Next Page` link. Is there one? Can you share part of the HTML and the "strange pagination"? And does scrapy stop after crawling the last page, or keeps going? – Robin Feb 09 '14 at 23:54
  • Scrapy crawl start page and then it goes to the 4th and crawl it, then 3th and it stop after finish second page. Here is the page (start_url) which I try to crawl http://www.klikoglasi.com/oglasi/auto-moto/putnicka-vozila.html – Goran Feb 10 '14 at 00:11

3 Answers3

1

From the Scrapy FAQ:

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
R. Max
  • 6,624
  • 1
  • 27
  • 34
0

scrapy sgml link extractor process links through a python set() for uniqueness, see scrapy utils unique function meaning there is no control over ordering in the current implementation, it also worth noting that even if ordering is implemented (by inheriting sgml extractor) there is no guaranty that the order of requests will be the same as the order of responses, it is very possible that some request will takes longer than another causing its response to be received latter as the calls are asynchronous.

if ordering is absolutely necessary, the only way to ensure ordering is to make the calls serially, one way to do it is to have the calls urls in the request meta and call the next request upon receiving a response, but that really makes the use of twisted parallelism useless

Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
  • Thanks for the answer. I see that problem can't be fixed because it's the way how crawling process work. I'll try to create some another programming logic and order it after items are downloaded. – Goran Feb 10 '14 at 17:59
  • welcome :) might worth including **why** order is important, also, maybe scrapy is not the best fit in your case, I guess you already know `lxml` can work perfectly without it... – Guy Gavriely Feb 10 '14 at 18:02
0

Its late, but for future reference

CONCURRENT_REQUESTS = 1

It will process the request one by one so it will keep the order too.

Tahir Shahzad
  • 649
  • 7
  • 18