Scrapy, bind start_url to a specific parse item so they can be parsed one after another

Question

Hello! I am running into some issues, I was trying to figure out how to set a start url to a specific parse_item method inside the crawlspider class.

Let's say I have more than one start url, two for the sake of simplicity.

So: start_urls = ["www.website1.com","www.website2.com"]

Now let's say I have two parse functions named parse_item1 and parse_item2.

I already set parse_item1 to callback on parse_item2 and vica versa.

So they do run in order of one another.

Now I am having some problems I want to go through each start_url one after the other.

So as followed: example1,example2,example1,example2. Not: example1,example1,example2,example2,example2,example1.

I thought I'd use two parse_item functions to do so BUT now I have a problem.

Even though they still call each other in order they tend to not call each start url in order.

So my question is, is it possible and if it is how can I bind for example www.example1.com to parse_item1 and www.example2.com to parse_item2 so they get called one after the other.

class juggler(CrawlSpider):

name = "juggle"
allowed_domains = ["example1.com","example2.com"]
start_urls = ["http://www.example1.com/","http://www.example2.com/"]
rules = [
    Rule(LinkExtractor(),callback="parse_all",follow=False)
    ]



def parse_all(self,response):
    yield self.parse_item1(response) 
    yield self.parse_item2(response)

def parse_item1(self,response):
    time.sleep(1)
    item = TwolaircrawlerItem()
    print "Item 1!"
    link = response.url
    print link
    return Request(url=link,callback="self.parse_item2")


def parse_item2(self,response):
    time.sleep(1)
    item = TwolaircrawlerItem()
    print "Item 2!"
    link = response.url
    print link
    return Request(url=link,callback="self.parse_item1")

score 2 · Accepted Answer · edited May 23 '17 at 11:59

2

There is no guaranteed order by default, this is how Scrapy works.

If you need to process requests one by one in a strict order, you would need to maintain a queue of requests manually, like suggested here:

https://stackoverflow.com/a/11235898/771848

edited May 23 '17 at 11:59

Community

1
1

answered May 19 '16 at 20:49

alecxe

462,703
120
1,088
1,195

Scrapy, bind start_url to a specific parse item so they can be parsed one after another

1 Answers1