0

Simply I need three conditions.

1) Log-in
2) Multiple request
3) Synchronous request ( sequential like 'C' )

I realized 'yield' should be used for multiple request.
But I think 'yield' works differently with 'C' and not sequential.
So I want to use request without 'yield' like below.
But crawl method wasn`t called ordinarily.
How can I call crawl method sequentially like C ?

class HotdaySpider(scrapy.Spider):

name = "hotday"
allowed_domains = ["test.com"]
login_page = "http://www.test.com"
start_urls = ["http://www.test.com"]

maxnum = 27982
runcnt = 10

def parse(self, response):
    return [FormRequest.from_response(response,formname='login_form',formdata={'id': 'id', 'password': 'password'}, callback=self.after_login)]

def after_login(self, response):
    global maxnum
    global runcnt
    i = 0

    while i < runcnt :
        **Request(url="http://www.test.com/view.php?idx=" + str(maxnum) + "/",callback=self.crawl)**
        i = i + 1

def crawl(self, response):
    global maxnum
    filename = 'hotday.html'

    with open(filename, 'wb') as f:            
    f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
    maxnum = maxnum + 1
kevink
  • 1
  • 1

1 Answers1

0

When you return a list of requests (that's what you do when you yield many of them) Scrapy will schedule them and you can't control the order in which the responses will come.

If you want to process one response at a time and in order, you would have to return only one request in your after_login method and construct the next request in your crawl method.

def after_login(self, response):
    return Request(url="http://www.test.com/view.php?idx=0/", callback=self.crawl)

def crawl(self, response):
    global maxnum
    global runcnt
    filename = 'hotday.html'

    with open(filename, 'wb') as f:            
    f.write(unicode(response.body.decode(response.encoding)).encode('utf-8'))
    maxnum = maxnum + 1
    next_page = int(re.search('\?idx=(\d*)', response.request.url).group(1)) + 1
    if < runcnt:
        return Request(url="http://www.test.com/view.php?idx=" + next_page + "/", callback=self.crawl)
lufte
  • 1,283
  • 1
  • 11
  • 21