0

I am scraping a sequence of urls. The code is working but scrapy is not parsing the urls in sequential order. E.g. Although I am trying to parse url1, url2,...,url100, scrapy parses url2, url10,url1...etc.

It parses all the urls but when a specific url does not exist (e.g example.com/unit.aspx?b_id=10) Firefox shows me the result of my previous request. As I want to make sure that I don´t have duplicates, I need to ensure that the loop is parsing the urls sequentially and not "at will".

I tried "for n in range(1,101) and also a "while bID<100" the result is the same. (see below)

thanks in advance!

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are
    successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        bID=0
        #for n in range(1,100,1):
        while bID<100:
            bID=bID+1
            startURL='https://www.example.com/units.aspx?b_id=%d' % (bID)
            request=Request(url=startURL ,dont_filter=True,callback=self.parse_add_tables,meta={'bID':bID,'metaItems':[]})
            # print self.metabID
            yield request #Request(url=startURL ,dont_filter=True,callback=self.parse2)
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.
Jmm
  • 3
  • 2

2 Answers2

0

You can use use priority attribute in Request object. Scrapy guarantees the urls are crawled in DFO by default. But it does not ensure that the urls are visited in the order they were yielded within your parse callback.

Instead of yielding Request objects you want to return an array of Requests from which objects will be popped till it is empty.

For more info you can see here

Scrapy Crawl URLs in Order

Community
  • 1
  • 1
Mirage
  • 30,868
  • 62
  • 166
  • 261
  • Thank you for your answer! I searched the index but I didn't find this post. I am new to python and scrapy so I need learn more on how to change default attributes. – Jmm Feb 06 '13 at 13:16
0

You could try something like this. I'm not sure if it's fit for purpose on the basis that I haven't seen the rest of the spider code but here you go:

# create a list of urls to be parsed, in reverse order (so we can easily pop items off)
crawl_urls = ['https://www.example.com/units.aspx?b_id=%s' % n for n in xrange(99,1,-1)]

def check_login_response(self, response):
    """Check the response returned by a login request to see if we are successfully logged in.
    """
    if "Welcome!" in response.body:
        self.log("Successfully logged in. Let's start crawling!")
        print "Successfully logged in. Let's start crawling!"
        # Now the crawling can begin..
        self.initialized()
        return Request(url='https://www.example.com/units.aspx?b_id=1',dont_filter=True,callback=self.parse_add_tables,meta={'bID':1,'metaItems':[]})
    else:
        self.log("Something went wrong, we couldn't log in....Bad times :(")
        # Something went wrong, we couldn't log in, so nothing happens.

def parse_add_tables(self, response):
    # parsing code here
    if self.crawl_urls:
        next_url = self.crawl_urls.pop()
        return Request(url=next_url,dont_filter=True,callback=self.parse_add_tables,meta={'bID':int(next_url[-1:]),'metaItems':[]})

    return items
Talvalin
  • 7,789
  • 2
  • 30
  • 40