0

I have this code

def parse(self, response):

    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class="headline_area"]')
    items = []

    for site in sites[:5]:
        item = StackItem()
        log.msg(' LOOP' +str(ivar)+ '', level=log.ERROR)
        item['title'] ="yoo ma"
        request =  Request("blabla",  callback=self.test1)
        request.meta['item'] = item
        page_number = nextlink.split("&")[-1].split("=")[-1]
        if int(page_number) > 500:
           raise CloseSpider('Search Exceeded 500')
        ivar = ivar + 1
        yield request

        mylinks= soup.find_all('a')

        if mylinks:
            nextlink = mylinks[0].get('href')
            page_number = nextlink.split("&")[-3].split("=")[-1]
            request =  Request(urljoin(response.url, nextlink), callback=self.parse)
            request.meta['page'] = page_number
            yield request

Now my problem is that suppose i want to stop at page_number = 5

now scrappy goes to that page before the all items from page 1 , page 2 etc are downloaded and stops when it first reaches there.

How can get rid of that porblem that it prcess all links before going to page = 5

user19140477031
  • 363
  • 1
  • 4
  • 13

2 Answers2

0

Does the link has some regularity on different page? For example, if the 5th page's link is www.xxxx.net/nForum/#!article/Bet/447540?p=5. You can scrappy link with p=5 directly.

jinghli
  • 617
  • 4
  • 11
  • yesall are similar wuth p = 1 then 2 and then 5 and so on. can you explain more , how can i do that . Any documentation links – user19140477031 Jan 04 '13 at 04:51
  • Maybe I misunderstood your issue. You want to process all pages before 5th page, not jump to page 5 directly, right? If so, I think this issue may be related with the value of `page_number = nextlink.split("&")[-3].split("=")[-1]`. You can print it for debugging. – jinghli Jan 04 '13 at 05:49
-1

You can use the inline_requests decorator.

R. Max
  • 6,624
  • 1
  • 27
  • 34